![]() |
General Programming »
Internet / Network »
General
Intermediate
Offline Browser using WinInet, URL Moniker and MSHTML APIsBy Muhammad Sheraz SiddiqiThis article describes how to make an offline browser using Visual C++/Win32 APIs. |
VC6, VC7Win2K, WinXP, MFC, Dev
|
|
Advanced Search |
|
|
|
||||||||||||||||



This article demonstrates how to make an offline browser using Visual C++. It uses the following APIs:
Below is the brief description of the algorithm:
src attribute in every tag, value of src attribute is the URL of a resource. If URL of the resource is absolute, for e.g., www.google.com/images/logo.gif, it is OK, but if the URL is relative, for e.g., images/logo.gif, make it absolute using the host name. I.e., its absolute URL will become <Host Name>/<path>, for e.g., www.google.com/images/logo.gif.
src attribute to reflect if there are any changes in the URL of the resource. Relative URLs will remain same, but for absolute addresses, src attribute will be changed now to a relative one.
src attribute's value to srcdump, it is just for future references, so that the original src is still available. I'd like to explain the reason/scenario behind the development of this code snippet. I was working on a module which records user interactions with Web pages and I require to save the web page on the local hard drive without using the web browser's Save As option.
I searched a lot for some code that does the same for me, but didn't find any helpful material, so I decided to develop it myself. I am uploading it here because it may help others working on some related stuff and to get some feedback on any mistakes I made. I didn't use MFC just to make it compatible with Win32 Applications as well as with MFC.
Not to mention, it is my first ever article.
LoadHtml() works in two modes based on the value of the bDownload argument:
bDownload is true, it assumes that HTML is loaded already using SetHtml() function, and it doesn't execute the following code snippet, just populates the Hostname and Port fields from the URL.
bDownload is false, it first downloads the HTML from the URL specified and then populates the Hostname and Port fields. //Download Web Page using WININET HINTERNET hNet = InternetOpen("Offline Browser", INTERNET_OPEN_TYPE_PROXY, NULL, NULL, 0); if(hNet == NULL) return; HINTERNET hFile = InternetOpenUrl(hNet, sUrl.c_str(), NULL, 0, 0, 0); if(hFile == NULL) return; while(true) { const int MAX_BUFFER_SIZE = 65536; unsigned long nSize = 0; char szBuffer[MAX_BUFFER_SIZE+1]; BOOL bRet = InternetReadFile(hFile, szBuffer, MAX_BUFFER_SIZE, &nSize); if(!bRet || nSize <= 0) break; szBuffer[nSize] = '\0'; m_sHtml += szBuffer; }
BrowseOffline() assumes that the HTML is already loaded. First, it constructs the HTML DOM tree by loading the HTML into an MSHTML DOMDocument interface using the following code:
//Load HTML to Html Document SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1); VARIANT *param; bstr_t bsData = (LPCTSTR)m_sHtml.c_str(); hr = SafeArrayAccessData(psa, (LPVOID*)�m); param->vt = VT_BSTR; param->bstrVal = (BSTR)bsData; //write your buffer hr = pDoc->write(psa); //closes the document, "applying" your code hr = pDoc->close(); //Don't forget to free the SAFEARRAY! SafeArrayDestroy(psa);
Once the DOM tree is constructed, it's time to traverse it and seek for the resources that needs downloading.
Currently, I only seek for src attribute in all the elements, and once an src attribute is found, it is downloaded and saved to the local folder.
//Iterate through all the elements in the document MSHTML::IHTMLElementCollectionPtr pCollection = pDoc->all; for(long a=0;a<pCollection->length;a++) { std::string sValue; IHTMLElementPtr pElem = pCollection->item( a ); //If src attribute is found that means we've a resource to download if(GetAttribute(pElem, L"src", sValue)) { //If resource URL is relative if(!IsAbsolute(sValue)) { .......... } //If resource URL is absolute else { .......... } } }
If src attribute has an absolute URL of the resource, the following actions are taken:
src attribute to the relative local path.
src attribute as srcdump for future reference. //If resource URL is relative if(!IsAbsolute(sValue)) { if(sValue[0] == '/') sValue = sValue.substr(1, sValue.length()-1); //Create directories needed to hold this resource CreateDirectories(sValue, m_sDir); //Download the resource if(!DownloadResource(sValue, sValue)) { std::string sTemp = m_sScheme + m_sHost; sTemp += sValue; //Update src to the new src and put the original src attribute as //srcdump just for future references if(sTemp[0] == '/') sTemp = sTemp.substr(1, sTemp.length()-1); SetAttribute(pElem, L"src", sTemp); SetAttribute(pElem, L"srcdump", sValue); } //Unable to download the resource else { //Put srcdump same as src, It if for no use, I just put it to make //HTML DOM consistent SetAttribute(pElem, L"srcdump", sValue); } }
If src attribute has a relative URL of the resource, the following actions are taken:
src attribute to the relative local path if required.
src attribute as srcdump for future reference. //If resource URL is absolute else { std::string sTemp; //Make URL relative sTemp = TrimHostName(sValue); //Create directories needed to hold this resource CreateDirectories(sTemp, m_sDir); //Dowload the resource if(DownloadResource(sTemp, sTemp)) { //Update src to the new src and put the original src attribute as //srcdump just for future references if(sTemp[0] == '/') sTemp = sTemp.substr(1, sTemp.length()-1); SetAttribute(pElem, L"src", sTemp); SetAttribute(pElem, L"srcdump", sValue); } }
Original HTML is changed because of the values changed for src and the addition of srcdump attribute. Original HTML is finally updated and saved with the name [GUID].html, where GUID is a Globally Unique Identifier generated using CoCreateGuid(). It is just to make sure that it doesn't overwrite any existing web site in the same folder.
//Get upated HTML out of amendments we made and save it to the described directory MSHTML::IHTMLDocument3Ptr pDoc3 = pDoc; MSHTML::IHTMLElementPtr pDocElem; pDoc3->get_documentElement(&pDocElem); BSTR bstrHtml; pDocElem->get_outerHTML(&bstrHtml); std::string sNewHtml((const char*)OLE2T(bstrHtml)); SaveHtml(sNewHtml);
Once we've the absolute URL of the resource, it is straightforward to download it and save it to an appropriate local folder.
//Download specified resource if(URLDownloadToFile(NULL, sTemp.c_str(), sTemp2.c_str(), 0, NULL) == S_OK) return true; else return false;
I've tried to maintain the same directory on the local folder as it is on the website. For example: downloading the resource images/logo.gif first creates a folder images inside the directory specified by the user and then downloads logo.gif into that folder.
COfflineBrowser obj; char szUrl[1024]; printf("Enter URL: "); gets(szUrl); obj.SetDir("c:\\MyTemp\\"); obj.LoadHtml(szUrl, true); obj.BrowseOffline();
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 22 Mar 2005 Editor: Smitha Vijayan |
Copyright 2005 by Muhammad Sheraz Siddiqi Everything else Copyright © CodeProject, 1999-2009 Web15 | Advertise on the Code Project |