Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Offline Browser using WinInet, URL Moniker and MSHTML APIs

0.00/5 (No votes)
22 Mar 2005 1  
This article describes how to make an offline browser using Visual C++/Win32 APIs.

Output of the sample program

Web site saved on the hard disk

Resources saved in appropriate folders

Introduction

This article demonstrates how to make an offline browser using Visual C++. It uses the following APIs:

  1. WinInet - Download HTML of all the web pages.
  2. URL Moniker - Download all the resources, for e.g., images, style sheets etc. to the local folder.
  3. MSHTML - Traverse HTML DOM (Document Object Model) tree to get the list of all the resources that needs to be downloaded.

Below is the brief description of the algorithm:

  1. Download the HTML of the web page, for e.g., www.google.com, and save it to the hard disk in a specified folder.
  2. Traverse the HTML document and look for src attribute in every tag, value of src attribute is the URL of a resource. If URL of the resource is absolute, for e.g., www.google.com/images/logo.gif, it is OK, but if the URL is relative, for e.g., images/logo.gif, make it absolute using the host name. I.e., its absolute URL will become <Host Name>/<path>, for e.g., www.google.com/images/logo.gif.
  3. Update src attribute to reflect if there are any changes in the URL of the resource. Relative URLs will remain same, but for absolute addresses, src attribute will be changed now to a relative one.
  4. Save the original src attribute's value to srcdump, it is just for future references, so that the original src is still available.

Background

I'd like to explain the reason/scenario behind the development of this code snippet. I was working on a module which records user interactions with Web pages and I require to save the web page on the local hard drive without using the web browser's Save As option.

I searched a lot for some code that does the same for me, but didn't find any helpful material, so I decided to develop it myself. I am uploading it here because it may help others working on some related stuff and to get some feedback on any mistakes I made. I didn't use MFC just to make it compatible with Win32 Applications as well as with MFC.

Not to mention, it is my first ever article.

Using the code

Download HTML of the Web Page:

LoadHtml() works in two modes based on the value of the bDownload argument:

  1. If bDownload is true, it assumes that HTML is loaded already using SetHtml() function, and it doesn't execute the following code snippet, just populates the Hostname and Port fields from the URL.
  2. If bDownload is false, it first downloads the HTML from the URL specified and then populates the Hostname and Port fields.
    //Download Web Page using WININET
    
    HINTERNET hNet = InternetOpen("Offline Browser", 
                     INTERNET_OPEN_TYPE_PROXY, NULL, NULL, 0);
    if(hNet == NULL)
        return;
    
    HINTERNET hFile = InternetOpenUrl(hNet, sUrl.c_str(), NULL, 0, 0, 0); 
    if(hFile == NULL)
        return;
    
    while(true)
    {
        const int MAX_BUFFER_SIZE = 65536;
        unsigned long nSize = 0;
        char szBuffer[MAX_BUFFER_SIZE+1];
        BOOL bRet = InternetReadFile(hFile, szBuffer, MAX_BUFFER_SIZE, &nSize);
        if(!bRet || nSize <= 0)
            break;
        szBuffer[nSize] = '\0';
        m_sHtml += szBuffer;
    }

Load HTML into MSHTML Document Interface:

BrowseOffline() assumes that the HTML is already loaded. First, it constructs the HTML DOM tree by loading the HTML into an MSHTML DOMDocument interface using the following code:

//Load HTML to Html Document

SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
VARIANT *param;
bstr_t bsData = (LPCTSTR)m_sHtml.c_str();
hr =  SafeArrayAccessData(psa, (LPVOID*)�m);
param->vt = VT_BSTR;
param->bstrVal = (BSTR)bsData;

//write your buffer

hr = pDoc->write(psa);
//closes the document, "applying" your code  

hr = pDoc->close();

//Don't forget to free the SAFEARRAY!

SafeArrayDestroy(psa);

Traverse DOM Tree and download all the resources:

Once the DOM tree is constructed, it's time to traverse it and seek for the resources that needs downloading.

Currently, I only seek for src attribute in all the elements, and once an src attribute is found, it is downloaded and saved to the local folder.

//Iterate through all the elements in the document

MSHTML::IHTMLElementCollectionPtr pCollection = pDoc->all;
for(long a=0;a<pCollection->length;a++)
{
    std::string sValue;
    IHTMLElementPtr pElem = pCollection->item( a );
    //If src attribute is found that means we've a resource to download

    if(GetAttribute(pElem, L"src", sValue))
    {
        //If resource URL is relative

        if(!IsAbsolute(sValue))
        {
            ..........
        }
        //If resource URL is absolute

        else
        {
            ..........
        }
    }
}

Download Resource with Absolute Path

If src attribute has an absolute URL of the resource, the following actions are taken:

  1. Download the resource and save it to the appropriate folder in the local folder.
  2. Update the src attribute to the relative local path.
  3. Save the value of the original src attribute as srcdump for future reference.
    //If resource URL is relative
    
    if(!IsAbsolute(sValue))
    {
        if(sValue[0] == '/')
            sValue = sValue.substr(1, sValue.length()-1);
        //Create directories needed to hold this resource
    
        CreateDirectories(sValue, m_sDir);
        //Download the resource
    
        if(!DownloadResource(sValue, sValue))
        {
            std::string sTemp = m_sScheme + m_sHost;
            sTemp += sValue;
            //Update src to the new src and put the original src attribute as
    
            //srcdump just for future references
    
            if(sTemp[0] == '/')
                sTemp = sTemp.substr(1, sTemp.length()-1);
            SetAttribute(pElem, L"src", sTemp);
            SetAttribute(pElem, L"srcdump", sValue);
        }
        //Unable to download the resource
    
        else
        {
            //Put srcdump same as src, It if for no use, I just put it to make
    
            //HTML DOM consistent
    
            SetAttribute(pElem, L"srcdump", sValue);
        }
    }

Download Resource with Relative Path

If src attribute has a relative URL of the resource, the following actions are taken:

  1. Construct absolute URL from the relative URL using Hostname and Port fields.
  2. Download the resource and save it to the appropriate folder in the local folder.
  3. Update src attribute to the relative local path if required.
  4. Save the value of original src attribute as srcdump for future reference.
    //If resource URL is absolute
    
    else
    {
        std::string sTemp;
        //Make URL relative
    
        sTemp = TrimHostName(sValue);
        //Create directories needed to hold this resource
    
        CreateDirectories(sTemp, m_sDir);
        //Dowload the resource
    
        if(DownloadResource(sTemp, sTemp))
        {
            //Update src to the new src and put the original src attribute as
    
            //srcdump just for future references
    
            if(sTemp[0] == '/')
                sTemp = sTemp.substr(1, sTemp.length()-1);
            SetAttribute(pElem, L"src", sTemp);
            SetAttribute(pElem, L"srcdump", sValue);
        }
    }

Save updated HTML

Original HTML is changed because of the values changed for src and the addition of srcdump attribute. Original HTML is finally updated and saved with the name [GUID].html, where GUID is a Globally Unique Identifier generated using CoCreateGuid(). It is just to make sure that it doesn't overwrite any existing web site in the same folder.

//Get upated HTML out of amendments we made and save it to the described directory

MSHTML::IHTMLDocument3Ptr pDoc3 = pDoc;
MSHTML::IHTMLElementPtr pDocElem;
pDoc3->get_documentElement(&pDocElem);
BSTR bstrHtml;
pDocElem->get_outerHTML(&bstrHtml);
std::string sNewHtml((const char*)OLE2T(bstrHtml));
SaveHtml(sNewHtml);

Download Resources

Once we've the absolute URL of the resource, it is straightforward to download it and save it to an appropriate local folder.

//Download specified resource

if(URLDownloadToFile(NULL, sTemp.c_str(), sTemp2.c_str(), 0, NULL) == S_OK)
    return true;
else return false;

Directory Structure of the Web Site

I've tried to maintain the same directory on the local folder as it is on the website. For example: downloading the resource images/logo.gif first creates a folder images inside the directory specified by the user and then downloads logo.gif into that folder.

Sample Usage

COfflineBrowser obj;
char szUrl[1024];
printf("Enter URL: ");
gets(szUrl);
obj.SetDir("c:\\MyTemp\\");
obj.LoadHtml(szUrl, true);
obj.BrowseOffline();

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here