Click here to Skip to main content
15,881,089 members
Articles / Web Development / ASP.NET

HTTP Data Client - Web Scraping

Rate me:
Please Sign up or sign in to vote.
4.79/5 (8 votes)
21 Jul 2011CPOL12 min read 47.3K   1.7K   56   8
A HTTPWebRequest based library which abstracts how data is retrieved from web sources.

Introduction

The purpose of this article is to describe how you can implement a generic “HTTP Data Client” (I apologize if it sounds fussy) using C# which would allow you to query in an elegant manner any web based resource you would like. I would like to mention from the beginning that this is not “the perfect solution” and that for sure it can be improved in many ways, so please feel free to do so. The entire concept is based on the HTTPWebRequest object offered by .NET under the System.Net namespace.

Prerequisites

Before I start to dwell into the architecture and code, there are some extra libraries which are used and required by the “HTTP Data Client” project. Here is the list of libraries:

  • Db4Object (this is an object oriented database; I am using it mainly for embedded applications; there are two assembly files which are referenced: Db4objects.Db4o.dll and Db4objects.Db4o.NativeQueries.dll; you can get DB4Object from the following location: http://www.db4o.com/DownloadNow.aspx).
  • HTML Agility Pack (this is a library which allows you to process HTML content using various techniques, it is very handy when you would like to convert HTML DOM to XML; there is one assembly file referenced: HtmlAgilityPack.dll; you can get the library from the following location: http://htmlagilitypack.codeplex.com).
  • Microsoft MSHTML (its purpose is to render and parse HTML and JavaScript content).

If you are wondering why I have decided to use two different libraries to parse HTML content, the answer is straightforward. The HTML Agility Pack performs really well most of the time; the output you get is usually what you are expecting, but not always. So if one library fails to provide the expected results, I can switch to the other one. The major drawback of the MSHTML library in my opinion is the slow processing speed when it is integrated in a non-desktop application (e.g.: web sites, web services, etc.). The role of DB4Object in this project is to store configuration settings and cache content. One important thing that has to be mentioned about DB4Object is that the non-server version doesn’t support multi-threading (you can easily replace it with any other storage which is suitable for you).

Architecture

My solution contains four projects:

  • HtmlAgilityPack (the actual HTML Agility Pack project with source code)
  • HttpData.Client (the main project which implements the rules of HTML processing)
  • HttpData.Client.MsHtmlToXml (the wrapper project over MSHTML and some extensions of it)
  • HttpData.Client.Pdf (the project which implements some PDF processing using IFilter; not important for this article)

There is no point in discussing about the HTML Agility Pack since you can find all the details and documentation about it on http://htmlagilitypack.codeplex.com. I will focus mainly on HttpData.Client and try to offer you as many details and explanations as possible. The HTTP data client is designed to work in a similar way to the .NET SQL client (System.Data.SqlClient), you will notice that the classes included in the project and their logic resembles a lot (I hope it is not just my imagination). I will enumerate the interfaces and classes and provide details about their logic and purpose.

IHDPAdapter and HDPAdapter

The purpose of the HDPAdapter class is to allow integration of XML data with other data objects as DataTable and DataSet. The IHDPAdapter interface exposes two methods which convert XML data into either a DataTable or a DataSet. Currently, only the DataTable conversion method is implemented. Here is the code snippet for the interface and class:

IHDPAdapter code:
C#
using System.Data;
using System.Xml;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for integration with data objects
    /// (DataTable, DataSet, etc). Is implemented by HDPAdapter.
    /// </summary>
    public interface IHDPAdapter
    {
        #region
        /// <summary>
        /// Get or set the select HDPCommand object.
        /// </summary>
        IHDPCommand SelectCommand{ get; set; }
        #endregion

        #region METHODS
        /// <summary>
        /// Fill a data table with the content from a specified xml document object.
        /// </summary>
        /// <param name="table">Data table to be filled.</param>
        /// <param name="source">Xml document object
        /// of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names
        /// should be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        int Fill(DataTable table, XmlDocument source, bool useNodes);

        /// <summary>
        /// (NOT IMPLEMENTED) Fill a data set with the content
        /// from a specified xml document object.
        /// </summary>
        /// <param name="dataset">Data set to be filled.</param>
        /// <param name="source">Xml document object
        /// of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names should
        /// be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        int Fill(DataSet dataset, XmlDocument source, bool useNodes);
        #endregion
    }
}
HDPAdapter code:
C#
using System;
using System.Xml;
using System.Xml.XPath;
using System.Data;
using System.Text;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for integration
    /// with data objects (DataTable, DataSet, etc).
    /// </summary>
    public class HDPAdapter : IHDPAdapter 
    {
        #region PRIVATE VARIABLES
        private IHDPCommand _selectCommand;

        #endregion

        #region Properties
        /// <summary>
        /// Get or set the select IHDPCommand object.
        /// </summary>
        IHDPCommand IHDPAdapter.SelectCommand
        {
            get{ return _selectCommand; }
            set{ _selectCommand = value; }
        }

        /// <summary>
        /// Get or set the select HDPCommand object.
        /// </summary>
        public HDPCommand SelectCommand
        {
            get{ return (HDPCommand)_selectCommand; }
            set{ _selectCommand = value; }
        }

        /// <summary>
        /// Get or set the connection string.
        /// </summary>
        public string ConnectionString { get; set; }

        #endregion

        #region .ctor
        /// <summary>
        /// Create a new instance of HDPAdapter.
        /// </summary>
        public HDPAdapter()
        {
        }

        /// <summary>
        /// Create a new instance of HDPAdapter.
        /// </summary>
        /// <param name="connectionString">Connection string
        /// associated with HDPAdapter object.</param>
        public HDPAdapter(string connectionString)
        {
            this.ConnectionString = connectionString;
        }
        #endregion

        #region Public Methods
        /// <summary>
        /// Fill a data table with the content from a specified xml document object.
        /// </summary>
        /// <param name="table">Data table to be filled.</param>
        /// <param name="source">Xml document
        /// object of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names
        /// should be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        public int Fill(DataTable table, XmlDocument source, bool useNodes)
        {
            bool columnsCreated = false;
            bool resetRow = false;

            if(table == null || source == null)
                return 0;

            if (table.TableName.Length == 0)
                return 0;

            StringBuilder sbExpression = new StringBuilder("//");
            sbExpression.Append(table.TableName);

            XPathNavigator xpNav = source.CreateNavigator();
            if (xpNav != null)
            {
                XPathNodeIterator xniNode = xpNav.Select(sbExpression.ToString());

                while(xniNode.MoveNext())
                {
                    XPathNodeIterator xniRowNode = 
                       xniNode.Current.SelectChildren(XPathNodeType.Element);
                    while (xniRowNode.MoveNext())
                    {
                        if(resetRow)
                        {
                            xniRowNode.Current.MoveToFirst();
                            resetRow = false;
                        }

                        DataRow row = null;
                        if (columnsCreated)
                            row = table.NewRow();
                    
                        if(useNodes)
                        {
                            XPathNodeIterator xniColumnNode = 
                               xniRowNode.Current.SelectChildren(XPathNodeType.Element);
                            while (xniColumnNode.MoveNext())
                            {
                                if (!columnsCreated)
                                {
                                    DataColumn column = 
                                      new DataColumn(xniColumnNode.Current.Name);
                                    table.Columns.Add(column);
                                }
                                else
                                    row[xniColumnNode.Current.Name] = 
                                      xniColumnNode.Current.Value;
                            }
                        }
                        else
                        {
                            XPathNodeIterator xniColumnNode = xniRowNode.Clone();
                            bool onAttribute = xniColumnNode.Current.MoveToFirstAttribute();
                            while (onAttribute)
                            {
                                if (!columnsCreated)
                                {
                                    DataColumn column = 
                                      new DataColumn(xniColumnNode.Current.Name);
                                    table.Columns.Add(column);
                                }
                                else
                                    row[xniColumnNode.Current.Name] = 
                                      xniColumnNode.Current.Value;

                                onAttribute = xniColumnNode.Current.MoveToNextAttribute();
                            }
                        }

                        if (!columnsCreated)
                        {
                            columnsCreated = true;
                            resetRow = true;
                        }

                        if (row != null)
                            table.Rows.Add(row);
                    }
                }
            }

            return table.Rows.Count;
        }

        /// <summary>
        /// (NOT IMPLEMENTED) Fill a data set with the
        /// content from a specified xml document object.
        /// </summary>
        /// <param name="dataset">Data set to be filled.</param>
        /// <param name="source">Xml document object
        /// of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names
        /// should be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        public int Fill(DataSet dataset, XmlDocument source, bool useNodes)
        {
            throw new NotImplementedException();
        }
        #endregion

        #region Private Methods
        #endregion
    }
}

IHDPConnection and HDPConnection

As the name says, this represents the connection class which will manage in an abstract way how a connection behaves. The interface exposes a set of methods and properties relevant to it. There are only three methods exposed and implemented:

  • Open method (changes the connection state to open; this method has an override which accepts as parameter the URL of the web resource which will be opened)
  • Close method (changes the connection state to close, and if there is a cache storage in use, it closes it)
  • CreateCommand method (it creates a new HDPCommand object and assigns the current connection to it)

Now let us take a look at the properties exposed by IHDPConnection and implemented by HDPConnection:

  • ConnectionURL (represents the web resource URL which will be opened using the current connection)
  • KeepAlive (defines if the connection should be kept opened or not once the querying is done)
  • AutoRedirect (defines if the connection allows any auto-redirects to be performed)
  • MaxAutoRedirects (defines how many auto-redirects can be performed)
  • UserAgent (defines what user agent will be associated with the connection, e.g.: Internet Explorer, Chrome, Opera, etc.)
  • ConnectionState (read only property which provides information about the connection state; is connection opened or closed)
  • Proxy (defines what proxy will be used when querying is performed)
  • Cookies (cookies associated with the connection currently or when the querying takes place)
  • ContentType (defines what content type is expected when querying takes place, e.g.: application/x-www-form-urlencoded, application/json, etc.)
  • Headers (contains the headers associated with the connection currently or when the querying takes place)
  • Referer (contains the referrer which is going to be used when querying the connection URL)
IHDPConnection code:
C#
using System.Collections.Generic;
using System.Net;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for connection management of different
    /// web sources. Is implemented by HDPConnection.
    /// </summary>
    public interface IHDPConnection
    {
        #region MEMBERS
        #region METHODS
        /// <summary>
        /// Open connection.
        /// </summary>
        void Open();

        /// <summary>
        /// Close connection.
        /// </summary>
        void Close();

        /// <summary>
        /// Create a new HDPCommand object associated with this connection.
        /// </summary>
        /// <returns>HDPCommand object associated with this connection.</returns>
        IHDPCommand CreateCommand();
        #endregion

        #region PROPERTIES
        /// <summary>
        /// Get or set connection url.
        /// </summary>
        string ConnectionURL { get; set; }

        /// <summary>
        /// Get or set the value which specifies
        /// if the connection should be maintained openend.
        /// </summary>
        bool KeepAlive { get; set; }

        /// <summary>
        /// Get or set the value which specifies if auto redirection is allowed.
        /// </summary>
        bool AutoRedirect { get; set; }

        /// <summary>
        /// Get or set the value which specifies if maximum number of auto redirections.
        /// </summary>
        int MaxAutoRedirects { get; set; }

        /// <summary>
        /// Get or set the value which specifies the user agent to be used.
        /// </summary>
        string UserAgent { get; set; }

        /// <summary>
        /// Get the value which specifies the state of the connection.
        /// </summary>
        HDPConnectionState ConnectionState { get; }

        /// <summary>
        /// Get or set the value which specifies the connection proxy.
        /// </summary>
        HDPProxy Proxy { get; set; }

        /// <summary>
        /// Get or set the value which specifies the coockies used by connection.
        /// </summary>
        CookieCollection Cookies { get; set; }

        /// <summary>
        /// Get or set the value which specifies the content type.
        /// </summary>
        string ContentType { get; set; }

        /// <summary>
        /// Get or set headers details used in HttpWebRequest operations.
        /// </summary>
        List<HDPConnectionHeader> Headers { get; set; }

        /// <summary>
        /// Get or set Http referer.
        /// </summary>
        string Referer { get; set; }
        #endregion
        #endregion
    }
}
HDPConnection code:
C#
using System.Collections.Generic;
using System.Net;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for connection management of different web sources.
    /// </summary>
    public class HDPConnection : IHDPConnection
    {
        #region Private Variables
        private HDPConnectionState _connectionState;
        private string _connectionURL;
        private HDPCache cache;
        private bool useCache;
        #endregion

        #region Properties
        /// <summary>
        /// Get the value which specifies if caching will be used.
        /// </summary>
        public bool UseCahe
        {
            get { return useCache; }
        }

        /// <summary>
        /// Get HDPCache object.
        /// </summary>
        public HDPCache Cache
        {
            get { return cache; }
        }
        #endregion

        #region .ctor
        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        public HDPConnection()
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = "";
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
        }

        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        /// <param name="connectionURL">Url of the web source.</param>
        public HDPConnection(string connectionURL)
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = connectionURL;
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
        }

        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        /// <param name="cacheDefinitions">HDPCacheDefinition
        /// object used by caching mechanism.</param>
        public HDPConnection(HDPCacheDefinition cacheDefinitions)
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = "";
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
            cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
            useCache = true;
        }

        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        /// <param name="connectionURL">Url of the web source.</param>
        /// <param name="cacheDefinitions">HDPCacheDefinition
        /// object used by caching mechanism.</param>
        public HDPConnection(string connectionURL, HDPCacheDefinition cacheDefinitions)
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = connectionURL;
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
            cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
            useCache = true;
        }
        #endregion

        #region Public Methods
        #endregion

        #region IHDPConnection Members
        #region Methods
        /// <summary>
        /// Open connection.
        /// </summary>
        public void Open()
        {
            _connectionState = HDPConnectionState.Open;
        }

        /// <summary>
        /// Open connection using a specific url.
        /// </summary>
        /// <param name="connectionURL">Url of the web source.</param>
        public void Open(string connectionURL)
        {
            _connectionURL = connectionURL;
            _connectionState = HDPConnectionState.Open;
        }

        /// <summary>
        /// Close connection.
        /// </summary>
        public void Close()
        {
            _connectionState = HDPConnectionState.Closed;

            if (cache != null)
                cache.CloseStorageConnection();
        }

        /// <summary>
        /// Create a new IHDPCommand object associated with this connection.
        /// </summary>
        /// <returns>IHDPCommand object associated with this connection.</returns>
        IHDPCommand IHDPConnection.CreateCommand()
        {
            HDPCommand command = new HDPCommand { Connection = this };
            return command;
        }

        /// <summary>
        /// Create a new HDPCommand object associated with this connection.
        /// </summary>
        /// <returns>HDPCommand object associated with this connection.</returns>
        public HDPCommand CreateCommand()
        {
            HDPCommand command = new HDPCommand { Connection = this };
            return command;
        }
        #endregion

        #region Properties
        /// <summary>
        /// Get or set connection url.
        /// </summary>
        public string ConnectionURL
        {
            get { return _connectionURL; }
            set { _connectionURL = value; }
        }

        /// <summary>
        /// Get or set the value which specifies if auto redirection is allowed.
        /// </summary>
        public bool AutoRedirect { get; set; }

        /// <summary>
        /// Get or set the value which specifies if maximum number of auto redirections.
        /// </summary>
        public int MaxAutoRedirects { get; set; }

        /// <summary>
        /// Get or set the value which specifies if the
        /// connection should be maintained openend.
        /// </summary>
        public bool KeepAlive { get; set; }

        /// <summary>
        /// Get or set the value which specifies the user agent to be used.
        /// </summary>
        public string UserAgent { get; set; }

        /// <summary>
        /// Get or set the value which specifies the content type.
        /// </summary>
        public string ContentType { get; set; }

        /// <summary>
        /// Get or set the value which specifies the coockies used by connection.
        /// </summary>
        public CookieCollection Cookies { get; set; }

        /// <summary>
        /// Get the value which specifies the state of the connection.
        /// </summary>
        public HDPConnectionState ConnectionState
        {
            get { return _connectionState; }
        }

        /// <summary>
        /// Get or set the value which specifies the connection proxy.
        /// </summary>
        public HDPProxy Proxy { get; set; }

        /// <summary>
        /// Get or set headers details used in HttpWebRequest operations.
        /// </summary>
        public List<HDPConnectionHeader> Headers { get; set; }

        /// <summary>
        /// Get or set Http referer.
        /// </summary>
        public string Referer { get; set; }
        #endregion
        #endregion

        #region IDisposable Members
        ///<summary>
        /// Dispose current object.
        ///</summary>
        public void Dispose()
        {
            this.dispose();
            System.GC.SuppressFinalize(this);
        }

        private void dispose()
        {
            if (_connectionState == HDPConnectionState.Open)
                this.Close();
        }
        #endregion
    }
}

IHDPCommand and HDPCommand

This represents our engine which provides the functionality for querying web resources and processing the result (response). It offers a variety of ways that can be used to process the response content of the query as: XPath, RegEx, XSLT, Reflection, etc. I will discuss in detail only the main methods, the rest of them are leveraged on those, and I assume the comments which accompany the methods will suffice to provide guidance in the right direction. But before I’ll reach the methods, let me present to you the properties. I will not post here the content of the HDPCommand class due to its large number of lines of code. You will be able to analyze it in detail using the source code provided.

  • Connection (defines the connection object associated with this command)
  • Parameters (defines the parameters used in the querying process)
  • CommandType (defines the command type used in the querying process; it is either GET or POST)
  • CommandText (defines the content of the command which is going to executed; if this is a GET command, then the URL with query parameters are stored, if it is a POST command, then the body content of the POST action is stored)
  • CommandTimeout (defines the time period in which a response is expected from the web resource)
  • Response (contains the response string received from the web resource based on a query action)
  • Uri (contains the URI of the queried web resource)
  • Path (contains the path of the web resource queried)
  • LastError (contains the last error message encountered in the process)
  • ContentLength (contains the length of the content received from the web resource based on a query action)

We can now move to the exposed/implemented methods.

  • GetParametersCount (gets the number of parameters used in the query process)
  • CreateParameter (creates a new parameter to be used in the query process)
  • ExecuteNonQuery (executes a query on a web resource using either GET or POST method, and it returns the number of results received; it has a parameter which specifies if the collection of parameters used in the query process should be cleaned at the end)
  • Execute (executes a query on a web resource using either the GET or POST method, and it returns a boolean value: true if the query was executed with success, false if it failed)
  • ExecuteStream (executes a query on a web resource using either GET or POST method, and it returns the underlying HTTP response stream)
  • CloseResponse (it closes the HTTP response stream opened by the ExecuteStream method)
  • ExecuteNavigator (executes a query on a web resource using either GET or POST method, and it returns an XPathNavigator object used to navigate through the response converted to XML; it has a parameter which specifies if the collection of parameters used in the query process should be cleaned at the end)
  • ExecuteDocument (it has an override, and executes a query on a web resource using either GET or POST method, and it returns an IXPathNavigable object used to navigate through the response converted to XML, the “expression” parameter represents an XPath expression which will be used in the processing of the result)
  • ExecuteBinary (executes a query on a web resource using either GET or POST method, and it returns a result in byte array format; this is mostly used when querying binary content from web resources, e.g.: PDF files, images; one of the overridden method parameters imposes a limit on the output buffer)
  • ExecuteBinaryConversion (executes a query on a web resource using either GET or POST method, and it returns result as string; this is used when querying binary content from web resources, e.g.: PDF files, and the content of a PDF file is converted from binary to string)
  • ExecuteString (executes a query on a web resource using either GET or POST method, and it returns result as plain string)
  • ExecuteValue (executes a query on a web resource using either GET or POST method, and it returns result as string which is a representation of an XPath expression applied; instead of an XPath expression, it can be a RegEx)
  • ExecuteCollection (executes a query on a web resource using either GET or POST method, and it returns result as a generic string collection; the result is a representation of either an XPath or RegEx expression applied on the result)
  • ExecuteArray (executes a query on a web resource using either GET or POST method, and it returns result as a string array; the result is a representation of either an XPath or RegEx expression applied on the result)
IHDPCommand code:
C#
using System.Collections.Generic;
using System.IO;
using System.Xml.XPath;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for querying and processing data from
    /// different web sources. Is implemented by HDPCommand.
    /// </summary>
    public interface IHDPCommand
    {
        #region Members
        #region Properties
        /// <summary>
        /// Get or set the command connection object.
        /// </summary>
        IHDPConnection Connection { get; set; }

        /// <summary>
        /// Get or set the command parameters collection.
        /// </summary>
        IHDPParameterCollection Parameters { get; }

        /// <summary>
        /// Get or set the command type.
        /// </summary>
        HDPCommandType CommandType { get; set; }

        /// <summary>
        /// Get or set the command text. 
        /// </summary>
        string CommandText { get; set; }

        /// <summary>
        /// Get or set the command timeout.
        /// </summary>
        int CommandTimeout { get; set; }

        /// <summary>
        /// Get the response retrieved from the server.
        /// </summary>
        string Response { get; }

        /// <summary>
        /// Get web resource URI.
        /// </summary>
        string Uri { get; }

        /// <summary>
        /// Get web resource absolute path.
        /// </summary>
        string Path { get; }

        /// <summary>
        /// Get the last error occurend.
        /// </summary>
        string LastError { get; }

        /// <summary>
        /// Get the content length of response.
        /// </summary>
        long ContentLength { get; }
        #endregion

        #region Methods
        /// <summary>
        /// Get the parameters number.
        /// </summary>
        /// <returns>Number of parameters.</returns>
        int GetParametersCount();

        /// <summary>
        /// Create a new IHDPParameter object.
        /// </summary>
        /// <returns>IHDPParameter parameter object.</returns>
        IHDPParameter CreateParameter();

        /// <summary>
        /// Execute a expression against the web server and return the number of results.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Number of results determined by the expression.</returns>
        int ExecuteNonQuery(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and does not read the response stream.
        /// </summary>
        /// <returns>True is the command executed with success otherwise false.</returns>
        bool Execute();

        /// <summary>
        /// Execute a query against the web server.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>True is the command executed with success otherwise false.</returns>
        bool Execute(bool clearParams);

        /// <summary>
        /// Execute a query against the web server.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Returns the underlying http response stream.</returns>
        Stream ExecuteStream(bool clearParams);

        /// <summary>
        /// Closes the http response object..
        /// Usable only with ExecuteStream method.
        /// </summary>
        void CloseResponse();

        /// <summary>
        /// Execute a query against the web server and return a XPathNavigator
        /// object used to navigate thru the query result.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>XPathNavigator object used
        /// to navigate thru the query result.</returns>
        XPathNavigator ExecuteNavigator(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return
        /// a IXPathNavigable object used to navigate thru the query result.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>IXPathNavigable object used to navigate thru the query result.</returns>
        IXPathNavigable ExecuteDocument(bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a IXPathNavigable
        /// object used to navigate thru query result.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>IXPathNavigable object used to navigate thru query result.</returns>
        IXPathNavigable ExecuteDocument(string expression, bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a byte[] object which
        /// contains the binary query result. Used when querying
        /// binary content from web server (E.g: PDF files).
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Byte array object which contains the binary query result.</returns>
        byte[] ExecuteBinary(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a byte[] object which
        /// contains the binary query result. Used when querying
        /// binary content from web server (E.g: PDF files).
        /// </summary>
        /// <param name="boundaryLimit">Specify the limit
        /// of the buffer which must be read.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Byte array object which contains the binary query result.</returns>
        byte[] ExecuteBinary(int boundaryLimit, bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a string object
        /// which contains the representation of the binary query result.
        /// Used when querying binary content from web server (E.g: PDF files).
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String object which contains the representation
        /// of the binary query result.</returns>
        string ExecuteBinaryConversion(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a string
        /// object which contains the representation of the query result.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String object which contains
        /// the representation of the query result.</returns>
        string ExecuteString(bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a string object which contains
        /// the representation of the query result value.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String object which contains
        /// the representation of the query result value.</returns>
        string ExecuteValue(string expression, bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a regular expression and return a string object which
        /// contains the representation of the query result value.
        /// </summary>
        /// <param name="expression">Regular expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <param name="isRegEx">Specify the a regular expression
        /// is used, it must always be to true.</param>
        /// <returns>String object which contains
        /// the representation of the query result value.</returns>
        string ExecuteValue(string expression, bool clearParams, bool isRegEx);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a regular expression and return a List object which
        /// contains the representation of the query result.
        /// </summary>
        /// <param name="expression">Regular expression.</param>
        /// <param name="clearParams">Specify if the parameters collection
        /// should be cleared after the query is executed.</param>
        /// <param name="isRegEx">Specify the a regular expression
        /// is used, it must always be to true.</param>
        /// <returns>List object which contains
        /// the representation of the query result.</returns>
        List<string> ExecuteCollection(string expression, bool clearParams, bool isRegEx);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a List object which
        /// contains the representation of the query result.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>List object which contains
        /// the representation of the query result.</returns>
        List<string> ExecuteCollection(string expression, bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a regular expression and return a string array object which
        /// contains the representation of the query result.
        /// </summary>
        /// <param name="expression">Regular expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <param name="isRegEx">Specify the a regular expression
        /// is used, it must always be to true.</param>
        /// <returns>String array object which contains
        /// the representation of the query result.</returns>
        string[] ExecuteArray(string expression, bool clearParams, bool isRegEx);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a string array object
        /// which contains the representation of the query result.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String array object which contains
        /// the representation of the query result.</returns>
        string[] ExecuteArray(string expression, bool clearParams);
        #endregion
        #endregion
    }
}

HDPCache, HDPCacheDefinition, HDPCacheObject, and HDPCacheStorage

HDPCache, HDPCacheDefinition, HDPCacheObject, and HDPCacheStorage are the classes which handle the cache. I will not insist on this subject since is not so important in this case. If you like, you can study those classes in more detail by yourself. I think the code comments will help you to grasp their purpose and functionality quite fast. The class HDPCacheObject is straightforward; it contains a set of properties which define the cache behavior. Here are its properties:

  • StorageActiveUntil (defines the date until the cache is considered to be valid)
  • MemorySizeLimit (imposes a memory size limit of the cache)
  • ObjectsNumberLimit (imposes an objects number limit on the cache)
  • UseStorage (defines if the cache should be persisted on disk)
  • RetrieveFromStorage (defines if a specific value should be searched on the persisted cache on disk)
  • RealtimePersistance (defines if the cache will be persisted on disk in real time once a new value has been added to it)
  • StorageName (defines the file name of the posted cache on disk)

The class HDPCacheObject is just a value pair set of properties and a time stamp field used to identify the cached object age. The caching system works in a very clear and simple manner. When a web resource is queried, the URL of it represents the cache object key and the result of the query represents the cache object value. If the cache is activated when using an HDPCommand object to query web resources, each query URL and response content is stored in the memory cache. If the same web resource is queried again using the same URL, the HTTP request is not performed and the response content is retrieved from the memory cache. There are extra options defined in HDPCacheDefinition which allow you to control how the cache behaves. For example, if you impose a cache memory limit of 1024 KB, then every time a new value is added to the cache, its memory footprint is calculated. In case the imposed limit is exceeded, based on other behavior definitions, the cache content is either stored on disk or deleted. I would like to mention that MemorySizeLimit and ObjectsNumberLimit are mutually exclusive. So if you define a value for the MemorySizeLimit greater than 0, then there is no point in defining a value for ObjectsNumberLimit because it will not be taken into consideration, and vice versa.

HDPCacheDefinition code:
C#
using System;

namespace HttpData.Client
{
    ///<summary>
    /// Defines the cache options.
    ///</summary>
    public class HDPCacheDefinition
    {
        #region Public Variables
        /// <summary>
        /// Specifies the date until which the cache is valid.
        /// </summary>
        public DateTime StorageActiveUntil = DateTime.Now.AddDays(1);

        /// <summary>
        /// Specifies the limit size of the cache memory.
        /// </summary>
        public long MemorySizeLimit;

        /// <summary>
        /// Specifies the limit number of objects which can be stored in the cache.
        /// </summary>
        public int ObjectsNumberLimit = 10000;

        /// <summary>
        /// Specifies if disk storage will be used.
        /// </summary>
        public bool UseStorage = true;
        
        ///<summary>
        /// Specifies if the data should be retrieved from the disk storage.
        ///</summary>
        public bool RetrieveFromStorage;

        /// <summary>
        /// Specifies if the persistance of the cache on disk will be done in real time.
        /// </summary>
        public bool RealtimePersistance;
        
        /// <summary>
        /// Specifies the name of the file of the disk storage.
        /// </summary>
        public string StorageName = "HttpDataProcessorCahe.che";
        #endregion
    }
}
HDPCacheObject code:
C#
using System;

namespace HttpData.Client
{
    /// <summary>
    /// Container for the cached data based on key value pair.
    /// </summary>
    [Serializable]
    public class HDPCacheObject
    {
        #region Private Variables
        private string key;
        private object value;
        private DateTime cacheDate;
        #endregion

        #region Properties
        /// <summary>
        /// Get or set the cache object key.
        /// </summary>
        public string Key
        {
            get { return key; }
            set { key = value; }
        }

        /// <summary>
        /// Get or set the cache object value.
        /// </summary>
        public object Value
        {
            get { return value; }
            set { this.value = value; }
        }

        /// <summary>
        /// Get or set the cache object date.
        /// </summary>
        public DateTime CacheDate
        {
            get { return cacheDate; }
        }
        #endregion

        #region .ctor
        /// <summary>
        /// Instantiate a new HDPCacheObject object.
        /// </summary>
        public HDPCacheObject()
        {
            cacheDate = DateTime.Now;
        }

        /// <summary>
        /// Instantiate a new HDPCacheObject object.
        /// </summary>
        /// <param name="key">Key for the cache object</param>
        /// <param name="value">Value for the cache object</param>
        public HDPCacheObject(string key, object value)
        {
            this.key = key;
            this.value = value;

            cacheDate = DateTime.Now;
        }
        #endregion

        #region Public Methods
        #endregion

        #region Private Methods
        #endregion
    }
}

Using the Code

I will provide a couple of examples so you can figure out how things work. I consider this to be the best way to understand how the earth spins. Let us say, for example, that we would like to retrieve all Florida cities from the following page: http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34. Here is the code to achieve the above mentioned task.

C#
using System;
using System.Collections.Generic;
using HttpData.Client;

namespace CityStates
{
    class Program
    {
        static void Main(string[] args)
        {
            private const string connectionUrl = 
               "http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34";

        //Create a new instance of HDPCacheDefinition object.
            HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
            {
                UseStorage = false,
                StorageActiveUntil = DateTime.Now,
                ObjectsNumberLimit = 10000,
                RealtimePersistance = false,
                RetrieveFromStorage = false,
            //We will not use a disk storage
                StorageName = null
            };

            //Create a new instance of HDPConnection object.
            //Pass as parameters the initial connection URL and the cache definition object.
            HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
            {
                //Define the content type we would expect.
                ContentType = HDPContentType.TEXT,
                //We want to allow autoredirects
                AutoRedirect = true,
                //Do not perform more than 10 autoredirects
                MaxAutoRedirects = 10,
                //The user agent is FireFox 3
                UserAgent = HDPAgents.FIREFOX_3,
                //We do not want to use a proxy
                Proxy = null // If you want to use a proxy: Proxy = 
                  // new HDPProxy("http://127.0.0.1:999/"
                  //    /*This is your proxy address and its port*/, 
                  // "PROXY_USER_NAME", "PROXY_PASSWORD")
            };
            //Open the connection
            connection.Open();

            //Create a new instance of HDPCommand object.
            //Pass as parameter the HDPConnection object.
            HDPCommand command = new HDPCommand(connection)
            {
                //Activate the memory cache for fast access
                //on same  web resource multiple times
                ActivatePool = true,
                //We will perform an GET action
                CommandType = HDPCommandType.Get,
                //Set the time out period
                CommandTimeout = 60000,
                //Use MSHTML library instead of HtmlAgilityPack
                //(if the value is false then HtmlAgilityPack would be used)
                UseMsHtml = true
            };

            //Execute the query on the web resource. The received
            //HTTPWebResponse content will be converted to XML
            // and the XPath expression will be executed.
            //The method will return the list of Florida state cities.
            List<string> cities = 
              command.ExecuteCollection("//ul/li/b//text()[normalize-space()]", true);
            
        foreach (string city in cities)
            Console.WriteLine(city);
            
        connection.Close();
        }
    }
}

Here is a different example now. Let us say that we would like to login on to the LinkedIn network using a user name and password. Here is the code to achieve that:

C#
using System;
using System.Collections.Generic;
using HttpData.Client;

namespace CityStates
{
    class Program
    {
        static void Main(string[] args)
        {
            private const string connectionUrl = 
              "https://www.linkedin.com/secure/login?trk=hb_signin";

            //Create a new instance of HDPCacheDefinition object.
            HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
            {
                UseStorage = false,
                StorageActiveUntil = DateTime.Now,
                ObjectsNumberLimit = 10000,
                RealtimePersistance = false,
                RetrieveFromStorage = false,
            //We will not use a disk storage
                StorageName = null
            };

            //Create a new instance of HDPConnection object.
            //Pass as parameters the initial connection URL and the cache definition object.
            HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
            {
                //Define the content type we would expect.
                ContentType = HDPContentType.TEXT,
                //We want to allow autoredirects
                AutoRedirect = true,
                //Do not perform more than 10 autoredirects
                MaxAutoRedirects = 10,
                //The user agent is FireFox 3
                UserAgent = HDPAgents.FIREFOX_3,
                //We do not want to use a proxy
                Proxy = null // If you want to use a proxy: Proxy =
                // new HDPProxy("http://127.0.0.1:999/"
                //   /*This is your proxy address and its port*/,
                // "PROXY_USER_NAME", "PROXY_PASSWORD")
            };
            //Open the connection
            connection.Open();

            //Create a new instance of HDPCommand object.
            //Pass as parameter the HDPConnection object.
            HDPCommand command = new HDPCommand(connection)
            {
                //Activate the memory cache for fast access
                //on same  web resource multiple times
                ActivatePool = true,
                //We will perform an GET action
                CommandType = HDPCommandType.Get,
                //Set the time out period
                CommandTimeout = 60000,
                //Use HtmlAgilityPack (if the value is true then MSHTML would be used)
                UseMsHtml = false
            };

            //Define the query parameters used in the POST action.
            //The actual parameter name used by a browser
            //to authenticate you on Linkedin is without '@' sign.
            //Use a HTTP request analyzer and you will notice the difference.
            //This is how the actual POST body will look like:
            //   csrfToken="ajax:-3801133150663455891"&session_key
            //      ="YOUR_EMAIL@gmail.com"&session_password="YOUR_PASSWORD"
            //       &session_login="Sign+In"&session_login=""&session_rikey=""
            HDPParameterCollection parameters = new HDPParameterCollection();
            HDPParameter pToken = 
              new HDPParameter("@csrfToken", "ajax:-3801133150663455891");
            HDPParameter pSessionKey = 
              new HDPParameter("@session_key", "YOUR_EMAIL@gmail.com");
            HDPParameter pSessionPass = 
              new HDPParameter("@session_password", "YOUR_PASSWORD");
            HDPParameter pSessionLogin = 
              new HDPParameter("@session_login", "Sign+In");
            HDPParameter pSessionLogin_ = new HDPParameter("@session_login", "");
            HDPParameter pSessionRiKey = new HDPParameter("@session_rikey", "");

            parameters.Add(pToken);
            parameters.Add(pSessionKey);
            parameters.Add(pSessionPass);
            parameters.Add(pSessionLogin);
            parameters.Add(pSessionLogin_);
            parameters.Add(pSessionRiKey);

             //If everything went ok then linkeding will ask us to redirect
             //(unfortunately autoredirect doesn't work in this case).
            //Get the manual redirect URL value.
            string value = command.ExecuteValue(
                    "//a[@id='manual_redirect_link']/@href", true);
            if (value != null && String.Compare(value, 
                      "http://www.linkedin.com/home") == 0)
            {
                command.Connection.ConnectionURL = value;
                command.CommandType = HDPCommandType.Get;

                //Using the manual redirect URL, check if the opened
                //web page contains the welcome message.
                //If it does contain the message, then we are in.
                string content = 
                  command.ExecuteString("//title[contains(.,'Welcome,')]", true);

                if (content.Length > 0)
                    Console.WriteLine(content);
                else
                    Console.WriteLine("Login failed!");
            }
            
            connection.Close();
        }
    }
}

On your sample project, please add the following app.config content if you are going to use MSHTML.

XML
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="LogFilePath" value="..\Log\My-Log.txt"/>
    <add key="HtmlTagsPath" value="HtmlTags.txt"/>
    <add key="AttributesTagsPath" value="HtmlAttributes.txt"/>
  </appSettings>
</configuration>

Notes

  • HttpData.Client.Pdf - not all content belongs to me. I do not recall from where I got parts of it.
  • HDPUtils.cs - I am not proud of its content, I find it to be quite dirty so please ignore that for now.

Issues

HtmlAgilityPack - when used, sometimes the content converted by it doesn't match the actual HTML DOM structure, specially when it comes to the form element.

MSHTML - when used, it strips all content between the html tag and the body tag (including the html tag). It also validates the input HTML content against a list of valid elements and attributes, so everything that doesn't match will be removed. One important thing to note is that by default the JavaScript content is removed. You can change this behavior from the HtmlLoader.cs class found on the HttpData.Client.MsHtmlToXm project.

Points of Interest

It is quite obvious on what sort of applications you could make use of the above library.

History

No updates yet, but I am sure there will be some in the future.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Cyprus Cyprus
I am a senior software engineer with over 8 years experience. Have worked for different international software companies using different technologies and programming languages like: C/C++, lotus script, lotus API, C#, ASP.NET, WCF, MS-SQL, Oracle, Domino Server, JavaScript.

Comments and Discussions

 
QuestionExtremely Valuable Pin
Thread3085-Jun-12 10:11
Thread3085-Jun-12 10:11 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey9-Feb-12 23:04
professionalManoj Kumar Choubey9-Feb-12 23:04 
QuestionI haven't had a chance to try this out. Pin
Pete O'Hanlon21-Jul-11 23:51
mvePete O'Hanlon21-Jul-11 23:51 
AnswerRe: I haven't had a chance to try this out. Pin
Michael.Heliso22-Jul-11 2:01
Michael.Heliso22-Jul-11 2:01 
GeneralGood One.... Pin
ptejam21-Jul-11 0:38
ptejam21-Jul-11 0:38 
GeneralRe: Good One.... Pin
Michael.Heliso21-Jul-11 1:31
Michael.Heliso21-Jul-11 1:31 
Questionwell written article ,can HTTP Data Client scrape more URL? Pin
Alenty20-Jul-11 21:30
Alenty20-Jul-11 21:30 
AnswerRe: well written article ,can HTTP Data Client scrape more URL? Pin
Michael.Heliso21-Jul-11 0:49
Michael.Heliso21-Jul-11 0:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.