Simple WebPageCheck (Spider)

zeltera

3.33/5 (9 votes)

Jan 10, 2007

CPOL

2 min read

29309

657

Small application that checks a list of websites for specified text

Download demo project - 32.18 KB

Introduction

To do: An application that checks a list of sites if they exist and (if yes) contain a specified text!

I designed a GUI as it appears in the picture above. The interface has three textboxes multiline:

txtImitialList (the list of URLs for checking will be pasted here, one URL in a row)
txtGood (if the result of the check is positive - page exists and contains the text we search for - the URL will appear here)
txtBad (bad pages - check result is negative)

We also need a textbox to put the text that we are looking for: txtMustContain and a check box (case sensitive check or not).

And the last.... btnStart, a button that starts the process! The main job is done by this class that has only one static function:

public static string WebFetch(string url)

This function receives a string argument, the URL of the page, and returns the source of that page, as a string.

using System;
using System.Text;
using System.Net;
using System.IO;

namespace WindowsApplication1
{
    class WebFetchClass
    {
        public static string WebFetch(string url)
        {
            // used to build entire input
            StringBuilder sb = new StringBuilder();

            // used on each read operation
            byte[] buf = new byte[8192];

            // prepare the web page we will be asking for
            HttpWebRequest request = (HttpWebRequest)
                WebRequest.Create(url);

            // execute the request
            HttpWebResponse response = (HttpWebResponse)
                request.GetResponse();

            // we will read data via the response stream
            Stream resStream = response.GetResponseStream();

            string tempString = null;
            int count = 0;

            do
            {
                // fill the buffer with data
                count = resStream.Read(buf, 0, buf.Length);

                // make sure we read some data
                if (count != 0)
                {
                    // translate from bytes to ASCII text
                    tempString = Encoding.ASCII.GetString(buf, 0, count);

                    // continue building the string
                    sb.Append(tempString);
                }
            }
            while (count > 0); // any more data to read?

            // return page source
            return sb.ToString();
        }
    }
}

Don't forget to include the namespaces System.Net (for HttpWebResponse and HttpWebRequest) and System.IO (for stream functions):

using System.Net;
using System.IO;

In form1.cs file, I wrote three functions (to make it much easier to understand the program). Each function does almost the same job with some small differences.

The three functions are:

CheckPageLoad()          //check only if specified page exists on server

DoCheckCaseSensitive()   //DoCheck() - case Sensitive

DoCheck()                // Case Insensitive check for specified text

The code for this function is here:

private void CheckPageLoad()
        {
            int totalLinks;
            int count = 0;
            url_arr = txtImitialList.Text.Split('\n');
            totalLinks = url_arr.Length;

            for (int i = 0; i < totalLinks; i++)
            {
                count++;
                try
                {
                    if (WebFetchClass.WebFetch(url_arr[i]).Trim().Length > 10)
                        txtGood.Text += url_arr[i] + "\n";
                    else
                        txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                    txtGood.Update();
                }
                catch
                {
                    txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                }
                lblStatusCurrentPos.Text = count.ToString() + "/" + 
                        totalLinks.ToString();
                this.Update();
            }
        }
        private void DoCheckCaseSensitive()
        {
            int totalLinks;
            int count = 0;
            string to_check = txtMustContain.Text.Trim();
            url_arr = txtImitialList.Text.Split('\n');
            totalLinks = url_arr.Length;

            for (int i = 0; i < totalLinks; i++)
            {
                count++;
                try
                {
                    if (WebFetchClass.WebFetch(url_arr[i]).Trim().IndexOf(to_check) > 0)
                        txtGood.Text += url_arr[i] + "\n";
                    else
                        txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                    txtGood.Update();
                }
                catch
                {
                    txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                }
                lblStatusCurrentPos.Text = count.ToString() + "/" + 
                    totalLinks.ToString();
                this.Update();
            }
        }
        private void DoCheck()
        {
            int totalLinks;
            int count = 0;
            string to_check = txtMustContain.Text.Trim().ToLower();
            url_arr = txtImitialList.Text.Split('\n');
            totalLinks = url_arr.Length;

            for (int i = 0; i < totalLinks; i++)
            {
                count++;
                try
                {
                    if (WebFetchClass.WebFetch(url_arr[i]).Trim().ToLower().IndexOf
                                (to_check) > 0)
                        txtGood.Text += url_arr[i] + "\n";
                    else
                        txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                    txtGood.Update();
                }
                catch
                {
                    txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                }
                lblStatusCurrentPos.Text = count.ToString() + "/" + 
                    totalLinks.ToString();
                this.Update();
            }
        }

Ok... now let's start the application: Start Button!

private void btnStart_Click(object sender, EventArgs e)

This is the event that is raised when the Start button is pressed. Let's write some code for this event:

private void btnStart_Click(object sender, EventArgs e)
        {
            //clear (if exist) previews data(s)
            txtBad.Clear();
            txtGood.Clear();
            lblStatusCurrentPos.Text = "Starting...";
            

            if (txtMustContain.Text.Trim() == "")
            {
                //TODO: CheckPageLoad()
                Thread t = new Thread(new ThreadStart(CheckPageLoad));
                t.IsBackground = true;
                t.Start();
                return;
            }

            if (chkCaseSensitive.Checked)
            {
                //TODO: DoCheckCaseSensitive()
                Thread t = new Thread(new ThreadStart(DoCheckCaseSensitive));
                t.IsBackground = true;
                t.Start();
            }
            else
            {
                //TODO: DoCheck()
                Thread t = new Thread(new ThreadStart(DoCheck));
                t.IsBackground = true;
                t.Start();
            }
        }

As you can see, I run the function that accesses the Web in a separate thread, because I don't want the main window to be frozen as long as the process runs.

This is a very simple application, with no error checking. It can be improved by adding more threads or error checking.

History

10^th January, 2007: Initial post