Web Data Extraction by Crawling using WINHTTP and Document Object (DOM) Instantiation

Prabhdeep Singh

3.79/5 (12 votes)

Jul 21, 2003

4 min read

162070

An article on how to extract data from a web URL using the WINHTTP library and how to extract the DOM of the crawled document.

Introduction

This article deals with two major issues in automatic web data extraction.

How to use WINHTTP 5 library to do crawling (reading Web data).
How to take the data extracted from the WINHTTP and extract/instantiate the DOM out of it.

At the end we discuss how to make a recursive crawler.

Background

The WINHTTP library complies with HTTP 1.0/1.1 model that is based on a persistent (keep alive) protocol model which means that we first connect to a webserver and then make requests for the documents from it. The subsequent requests from a same webserver(hostname in our case) does not involve making and breaking the connection. We discuss here how to extract the HTML data given a string URL to you. The main problem I experienced was that for the crawling you might have a big url given to you. Now this has to be broken up into hostname ( for connection ) and the rest of the url path ( for request ). Of course you would say the WINHTTPCrackUrl method would do this job but it doesn't. It gives you the correct URL Path but not the correct hostname to connect to the server.

For the operating on DOM the most widely used interface series is the IHTMLDocument but an object of this type is usually instantiated and populated by the Browser object ( using the get_document method ). The issue here is how to populate this object with the plain text HTML we get from the WINHTTP.

These two steps go a long way to lay the foundations of a tool which can crawl the web and operate on DOM models of the web pages rather than doing plain string post-processing which most tools do. The similar feat can be obtained by invoking Navigate method on the IE and analyzing the DOM but its trivial to estimate how inefficient it would be to load the whole document (including images) and render it in the browser before getting the DOM.

Why WINHTP when there is WinInet

A traditional developer would say why use WinHTTP when we have the WinInet and it is touted by Microsoft for HTTP( as well as ftp and Gopher ) client applications. But WinInet poses a great stumbling block to total automation. When we do any authentication and some other operations through WinInet it displays a user interface. WinHTTP, however handles these operations programmatically.

How the Program looks

For example if the url to be traversed was http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq then WINHTTP expects to connect to news.yahoo.com which is the hostname of the webserver and then put in a request for /fc?tmpl=fc&cid=34&in=world&cat=iraq

Given below is a complete description of how to take the URL and spilt it (using WinHttpCrackUrl), doing changes after that because the cracking does not give us the results as we want and then feed this data to the WINHTTP calls After we are done with this; the data extraction from the URL comes into picture and to do this first we connect to the URL and we get the size of the data available on that url using WinHttpQueryDataAvailable. The catch is that we don't get all the data of a web page in one shot so We initialize a buffer to which we'll keep appending the data got from the WinHttpReadData and get the webpage when all the data has been read ( indicated by the available data size being equal to zero). This is exactly how an equivalent URLReader class in java works. Given below is the complete code to do such a feat with explicit comments at each step

USES_CONVERSION;

    // First, split up the URL
    URL_COMPONENTS urlComp;    // a structure that would contain the
                               // individual components of the URL
    LPCWSTR varURL;            // ***** varURL is the URL to be
                               // traversed
    DWORD dwUrlLen = 0;
    LPCWSTR hostname, optional;

    // Initialize the URL_COMPONENTS structure.
    ZeroMemory(&urlComp, sizeof(urlComp));
    urlComp.dwStructSize = sizeof(urlComp);

    //MessageBox(NULL,OLE2T(varURL),"the url to be traversed", 1);

    // Set required component lengths to non-zero so that they
    // are cracked.
    urlComp.dwSchemeLength    = -1;
    urlComp.dwHostNameLength  = -1;
    urlComp.dwUrlPathLength   = -1;
    urlComp.dwExtraInfoLength = -1;

    // Split the URL (varURL) into hostname and URL path
    if (!WinHttpCrackUrl( varURL, wcslen(pwszUrl1), 0, &urlComp))
    {
        printf("Error %u in WinHttpCrackUrl.\n", GetLastError());
    }
    
    // You can inspect the cracked URL here
    // For our example of varURL =
    // http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq
    // MessageBox(NULL,W2T(urlComp.lpszHostName),
    //            "INTERPRETER-> hostname",MB_OK);
    // We get the hostname as
    // "news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    //  MessageBox(NULL,W2T(urlComp.lpszUrlPath),
    //             "INTERPRETER-> urlpath",MB_OK);
    // We get the urlPath as "/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    // MessageBox(NULL,W2T(urlComp.lpszExtraInfo),
    //            "INTERPRETER->extrainfo",MB_OK);
    // We get the extrainfo as ""
    // MessageBox(NULL,W2T(urlComp.lpszScheme),
    //            "INTERPRETER->Scheme",MB_OK);
    // We get the scheme as
    // "http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    
    // Compute the correct hostname
    String myhostname(W2T(urlComp.lpszHostName));
    String myurlpath(W2T(urlComp.lpszUrlPath));
    int strindex = myhostname.IndexOf(myurlpath);
    String newhostname(myhostname.SubString(0,strindex));

    strindex = 0;


    DWORD dwSize        = 0;
    DWORD dwDownloaded  = 0;
    LPSTR pszOutBuffer;
    BOOL  bResults      = FALSE;
    HINTERNET  hSession = NULL,
               hConnect = NULL,
               hRequest = NULL;

    // Use WinHttpOpen to obtain a session handle.
    hSession = WinHttpOpen( L"WinHTTP Example/1.0",
                            WINHTTP_ACCESS_TYPE_DEFAULT_PROXY,
                            WINHTTP_NO_PROXY_NAME,
                            WINHTTP_NO_PROXY_BYPASS, 0);

    // Specify an HTTP server.
    // In our examples, it expects just "news.yahoo.com"
    if (hSession)
        hConnect = WinHttpConnect( hSession, T2W(newhostname),
                                   INTERNET_DEFAULT_HTTP_PORT, 0);

    // Create an HTTP request handle.
    // In our example, it expects
    // "/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    if (hConnect)
        hRequest = WinHttpOpenRequest( hConnect, L"GET",
                                       urlComp.lpszUrlPath,
                                       NULL, WINHTTP_NO_REFERER,
                                       WINHTTP_DEFAULT_ACCEPT_TYPES,
                                       WINHTTP_FLAG_REFRESH);
    // Send a request.
    if (hRequest)
        bResults = WinHttpSendRequest( hRequest,
                                       WINHTTP_NO_ADDITIONAL_
                                              HEADERS, 0,
                                       WINHTTP_NO_REQUEST_DATA, 0,
                                              0, 0);

    // End the request.
    if (bResults)
        bResults = WinHttpReceiveResponse( hRequest, NULL);
        String respage="";    // The buffer that'll contain the
                              // extracted Web page data

    // Keep checking for data until there is nothing left.
    if (bResults)
        do
        {

            // Check for available data.
            dwSize = 0;
            if (!WinHttpQueryDataAvailable( hRequest, &dwSize))
                printf("Error %u in WinHttpQueryDataAvailable.\n",
                        GetLastError());

            // Allocate space for the buffer.
            pszOutBuffer = new char[dwSize+1];
            if (!pszOutBuffer)
            {
                printf("Out of memory\n");
                dwSize=0;
            }
            else
            {
                // Read the Data.
                ZeroMemory(pszOutBuffer, dwSize+1);

                if (!WinHttpReadData( hRequest,
                                      (LPVOID)pszOutBuffer,
                                      dwSize, &dwDownloaded))
                    printf("Error %u in WinHttpReadData.\n",
                            GetLastError());
                else
                    respage.Append(pszOutBuffer);

                // Free the memory allocated to the buffer.
                delete [] pszOutBuffer;
            }

        } while (dwSize>0);
        // MessageBox(NULL,respage,"fetched page from
        // crawler",1);

When we are done with this, we have the HTML page as a string in the respage buffer. So now the aim is to get a DOM model of this, so that we can operate on the data programmatically like query the nodes, access particular elements and so on. The best way to do DOM manipulation is through the Microsoft provided interfaces IHTMLDocument, IHTMLDocument2, IHTMLDocument3 and IHTMLDocument4. The following code takes data from that buffer and makes an IHTMLDocument2 out of it. We can then use its various methods ( getBody, getInnerHTML, etc. ) to access the DOM or type cast it into a related interface like IHTMLDocument3 and query the nodes in the DOM tree.

        // Declare an IHTMLDocument structure
        IHTMLDocument2Ptr myDocument; // Declared earlier in the code
        HRESULT hr = CoCreateInstance(CLSID_HTMLDocument,NULL,
          CLSCTX_INPROC_SERVER,IID_IHTMLDocument2, (void **)&myDocument);
        HRESULT hresult = S_OK;
        VARIANT *param;
        SAFEARRAY *tmpArray;
        
        // Creates a new one-dimensional array
        // for holding the webpage data 
        tmpArray = SafeArrayCreateVector(VT_VARIANT, 0, 1);
        // Convert the buffer into binary string
        bstr_t bsData = (LPCTSTR) respage;
        hresult = SafeArrayAccessData(sfArray,(LPVOID*) & param);
        param->vt = VT_BSTR;
        param->bstrVal = bsData;
        hresult = myDocument->write(tmpArray); 
               // injected code in document structure
        hresult = SafeArrayUnaccessData(tmpArray);
        SysFreeString(bsData);
        if (tmpArray != NULL) {
            SafeArrayDestroy(tmpArray);
        }

Further Enhancements

I have highlighted the basics of crawling here and the complete crawler design is a user discretion. For a complete crawler we need to extract the links from a particular web page and extract data from those links. Traditional tools do string processing to look for anchor or the href tags and extract the hyperlink strings and it obviously seems an inefficient way because we have to parse all the page data. Querying the DOM for that purpose is much more efficient; we can just look for all the nodes with an anchor tag and extract the href attribute. Making a web-site grabber is very easy with the code i have given above. You can use the get_anchors method of IHTMLDocument2 to get the the hyperlinks from a page and then recursively call the code above after implementing proper checks for link loops. Such a program can crawl all the hyperlink accessible pages from a given base URL up to any number of levels.