


Introduction
This article demonstrates how to make an offline browser using Visual C++. It uses the following APIs:
- WinInet - Download HTML of all the web pages.
- URL Moniker - Download all the resources, for e.g., images, style sheets etc. to the local folder.
- MSHTML - Traverse HTML DOM (Document Object Model) tree to get the list of all the resources that needs to be downloaded.
Below is the brief description of the algorithm:
- Download the HTML of the web page, for e.g., www.google.com, and save it to the hard disk in a specified folder.
- Traverse the HTML document and look for
src attribute in every tag, value of src attribute is the URL of a resource. If URL of the resource is absolute, for e.g., www.google.com/images/logo.gif, it is OK, but if the URL is relative, for e.g., images/logo.gif, make it absolute using the host name. I.e., its absolute URL will become <Host Name>/<path>, for e.g., www.google.com/images/logo.gif.
- Update
src attribute to reflect if there are any changes in the URL of the resource. Relative URLs will remain same, but for absolute addresses, src attribute will be changed now to a relative one.
- Save the original
src attribute's value to srcdump, it is just for future references, so that the original src is still available.
Background
I'd like to explain the reason/scenario behind the development of this code snippet. I was working on a module which records user interactions with Web pages and I require to save the web page on the local hard drive without using the web browser's Save As option.
I searched a lot for some code that does the same for me, but didn't find any helpful material, so I decided to develop it myself. I am uploading it here because it may help others working on some related stuff and to get some feedback on any mistakes I made. I didn't use MFC just to make it compatible with Win32 Applications as well as with MFC.
Not to mention, it is my first ever article.
Using the code
Download HTML of the Web Page:
LoadHtml() works in two modes based on the value of the bDownload argument:
- If
bDownload is true, it assumes that HTML is loaded already using SetHtml() function, and it doesn't execute the following code snippet, just populates the Hostname and Port fields from the URL.
- If
bDownload is false, it first downloads the HTML from the URL specified and then populates the Hostname and Port fields.
HINTERNET hNet = InternetOpen("Offline Browser",
INTERNET_OPEN_TYPE_PROXY, NULL, NULL, 0);
if(hNet == NULL)
return;
HINTERNET hFile = InternetOpenUrl(hNet, sUrl.c_str(), NULL, 0, 0, 0);
if(hFile == NULL)
return;
while(true)
{
const int MAX_BUFFER_SIZE = 65536;
unsigned long nSize = 0;
char szBuffer[MAX_BUFFER_SIZE+1];
BOOL bRet = InternetReadFile(hFile, szBuffer, MAX_BUFFER_SIZE, &nSize);
if(!bRet || nSize <= 0)
break;
szBuffer[nSize] = '\0';
m_sHtml += szBuffer;
}
Load HTML into MSHTML Document Interface:
BrowseOffline() assumes that the HTML is already loaded. First, it constructs the HTML DOM tree by loading the HTML into an MSHTML DOMDocument interface using the following code:
SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
VARIANT *param;
bstr_t bsData = (LPCTSTR)m_sHtml.c_str();
hr = SafeArrayAccessData(psa, (LPVOID*)�m);
param->vt = VT_BSTR;
param->bstrVal = (BSTR)bsData;
hr = pDoc->write(psa);
hr = pDoc->close();
SafeArrayDestroy(psa);
Traverse DOM Tree and download all the resources:
Once the DOM tree is constructed, it's time to traverse it and seek for the resources that needs downloading.
Currently, I only seek for src attribute in all the elements, and once an src attribute is found, it is downloaded and saved to the local folder.
MSHTML::IHTMLElementCollectionPtr pCollection = pDoc->all;
for(long a=0;a<pCollection->length;a++)
{
std::string sValue;
IHTMLElementPtr pElem = pCollection->item( a );
if(GetAttribute(pElem, L"src", sValue))
{
if(!IsAbsolute(sValue))
{
..........
}
else
{
..........
}
}
}
Download Resource with Absolute Path
If src attribute has an absolute URL of the resource, the following actions are taken:
- Download the resource and save it to the appropriate folder in the local folder.
- Update the
src attribute to the relative local path.
- Save the value of the original
src attribute as srcdump for future reference.
if(!IsAbsolute(sValue))
{
if(sValue[0] == '/')
sValue = sValue.substr(1, sValue.length()-1);
CreateDirectories(sValue, m_sDir);
if(!DownloadResource(sValue, sValue))
{
std::string sTemp = m_sScheme + m_sHost;
sTemp += sValue;
if(sTemp[0] == '/')
sTemp = sTemp.substr(1, sTemp.length()-1);
SetAttribute(pElem, L"src", sTemp);
SetAttribute(pElem, L"srcdump", sValue);
}
else
{
SetAttribute(pElem, L"srcdump", sValue);
}
}
Download Resource with Relative Path
If src attribute has a relative URL of the resource, the following actions are taken:
- Construct absolute URL from the relative URL using Hostname and Port fields.
- Download the resource and save it to the appropriate folder in the local folder.
- Update
src attribute to the relative local path if required.
- Save the value of original
src attribute as srcdump for future reference.
else
{
std::string sTemp;
sTemp = TrimHostName(sValue);
CreateDirectories(sTemp, m_sDir);
if(DownloadResource(sTemp, sTemp))
{
if(sTemp[0] == '/')
sTemp = sTemp.substr(1, sTemp.length()-1);
SetAttribute(pElem, L"src", sTemp);
SetAttribute(pElem, L"srcdump", sValue);
}
}
Save updated HTML
Original HTML is changed because of the values changed for src and the addition of srcdump attribute. Original HTML is finally updated and saved with the name [GUID].html, where GUID is a Globally Unique Identifier generated using CoCreateGuid(). It is just to make sure that it doesn't overwrite any existing web site in the same folder.
MSHTML::IHTMLDocument3Ptr pDoc3 = pDoc;
MSHTML::IHTMLElementPtr pDocElem;
pDoc3->get_documentElement(&pDocElem);
BSTR bstrHtml;
pDocElem->get_outerHTML(&bstrHtml);
std::string sNewHtml((const char*)OLE2T(bstrHtml));
SaveHtml(sNewHtml);
Download Resources
Once we've the absolute URL of the resource, it is straightforward to download it and save it to an appropriate local folder.
if(URLDownloadToFile(NULL, sTemp.c_str(), sTemp2.c_str(), 0, NULL) == S_OK)
return true;
else return false;
Directory Structure of the Web Site
I've tried to maintain the same directory on the local folder as it is on the website. For example: downloading the resource images/logo.gif first creates a folder images inside the directory specified by the user and then downloads logo.gif into that folder.
Sample Usage
COfflineBrowser obj;
char szUrl[1024];
printf("Enter URL: ");
gets(szUrl);
obj.SetDir("c:\\MyTemp\\");
obj.LoadHtml(szUrl, true);
obj.BrowseOffline();
| You must Sign In to use this message board. |
|
|
 |
|
 |
Hi everyone!
First of all, thanks a lot. I integrated your code and some corrected points commented by others in my project and it helped me very much.
Here, just two comments for your reference as below. 1. Using IE7 case (without the proxy setting) Maybe it is not one problem only for IE7 user. In the function of COfflineBrowser::LoadHtml,the 2nd parameter of InternetOpen should be changed from INTERNET_OPEN_TYPE_PROXY to INTERNET_OPEN_TYPE_DIRECT.
2. Character set conversion from wchar* to char* (Problem occur when using Unicode character set) 1.) In the function "COfflineBrowser::BrowseOffline()" //std::string sNewHtml((const char*)OLE2T(bstrHtml)); std::string sNewHtml((const char*)OLE2A(bstrHtml)); 2.) In the function "COfflineBrowser::GetAttribute" //sValue = (LPCTSTR)OLE2T(var.bstrVal); sValue = (LPSTR)OLE2A(var.bstrVal);
Thanks again for everyone! 
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi,
First of all, thanks a lot. I integrated your code in my project and it helped me very much.
However, my project is multithreaded and I rapidly saw that there were some corrections needed regarding handles and memory leaks. These modifications are located in the functions LoadHtml() and BrowseOffline().
I am not an expert in COM or MSHTML but these modifications are based on code samples found on the web:
1) After finished using the handle of InternetOpen and InternetOpenUrl, I released it using ::InternetCloseHandle(hFile); ::InternetCloseHandle(hNet);
2) Replace "VARIANT *param" with "LPVARIANT param"
3) call SafeArrayUnaccessData(psa) before SafeArrayDestroy(psa);
4) call pDocElem.Release() and pDoc.Release() when finished using them
5) call ::SysFreeString(bstrHtml); when finished using bstrHtml.
Thanks 
Principal2
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
i find in the computers installed ie 6.0 ,it works well..but cant perform the download function in the computers installed ie 7.0 ..what'a wrong..?
camel
|
| Sign In·View Thread·PermaLink | 3.00/5 |
|
|
|
 |
|
 |
Hi, This is an interesting project and good. However I have found that in void COfflineBrowser::BrowseOffline()
There is a loop:
for(long a=0;alength;a++) { std::string sValue; IHTMLElementPtr pElem = pCollection->item( a ); . . . }
The code in this loop seems to run very slowly for webpages with large numbers of items. Is this because of using the IE model. I was thinking of replacing this with an HTML parser, which may be quicker. Has anyone got any recommendations for something to use?
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
How will it navigate pages which returns results from database after POSTing?i have used teleport pro which do stores ASP/PHP pages.how can i implement that?means i search a product "Intel Celeron" the result is submitted to browser and it retrieves result from database and display,can it be done without going to site?
-adnan
MyBlogs http://weblogs.com.pk/kadnan
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Be careful when you download content to your harddrive and browse locally. IE's security model may considered it a trusted website, since it is loading from your website. Check the security zone in the lower right corner of the IE window. I'm curious what it says, let me know.
If the site has a nasty script, you could be allowing it to run at a higher privilege.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Do you know , how can I intercept downloading files of an html document? I want to map files to another url and for example process the downloaded resource before it has been captured by ie?
thanks
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Amiri! Would you please elaborate your question ? Its difficult to understand exactly what you want to do. But what I understand is you want to intercept downloading of files in Internet Explorer and want to download these files from some other location. If that is the scneario, then as per my knowledge, Browser Helper Objects (BHO) is the solution. If it is your problem which I explained, you can contact me for any help regarding BHO.
Thanks! -Attari-
-Attari-
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Attari 
Yes,that's almost the scenario you have described. What exactly I want is a light sniffer for ie. I think that it's difficult to rip all resources from an html document (because of script-base nature of some pages). Here you have tried to capture href elements but I want more.
I have found a solution for this problem by running a TCP Listener on a local port and setting the ie-proxy to it and then processing the GET HTTP://... requests.But There is a problem with that, I dont't get any "GET ..." parameter for SSL requests....
I would be thanksfull to you if you describe me the way that BHOs work (and the benefits) and also if you know the solution of the problem above.
Thanks Reza
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
i think its so interesting tool, but some time you need other scenario to access some pages like cookies or post form data ,so the url not enough
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Actually downloading needs logging in, you've to first login to GMail account then you can be able to download GMail pages. I think better tool to use here will be either POP Client or IMAP Client based on the protocol implemented by the email server. Well as far as working with the cookies is concerned, WinSock has pretty decent collection of functions to work with Cookies and Cache, you can use these to modify this code to make it work with cookies.
-Attari-
|
| Sign In·View Thread·PermaLink | 2.33/5 |
|
|
|
 |
 | Small Bug  Hussain Software Solutions | 1:53 23 Mar '05 |
|
 |
Hi, How are you. Great Work. I have just downloaded your Demo project. I have given input as http://www.google.com but unforutnately it did not copied the whole file and Images. One more thing I want to know is that is it possible to view offiline contents through your Browser. Thanks & Regards, HSS
If you have faith in the cause and the means and in God, the hot Sun will be cool for you.
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
Hi!
I'm fine. Its working fine. I fear you missed out onething mentioned in the Sample Usage section, I hard coded directory where it saves all the files to make sample program look easy. You need to create folder C:\MyTemp to make it work b/c. directory of OfflineBrowser is set to C:\MyTemp.
No support for Offline contents browsing(browser web-sites from cache) in it for now.
Waiting for the response!
-Attari-
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
Yes I have created the C:\MyTemp Directory. Still it is not copyieng whole files and images from the Website. I mean it is copyieng html page but not fully with all its images and contents. Thanks, HSS
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi! There is only one image on Google and i.e. hp0.gif its saving it in the appropriate folder as describe in the article, If I give it the URL http://www.google.com/. These are the files I get: 1) 95d9adb4cd8d4cc683522e3fbf529a4c.htm 2) images/hp0.gif 3) images/hp1.gif 4) images/hp2.gif 5) images/hp3.gif
Please send me the zipped form of MyTemp folder at sheraz_attari@Hotmail.com, I'll look at it to figure out the bug.
Thanks!
-Attari-
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
|
 |
|
 |
Sheraz Siddiqi wrote: to make sample program look easy
I don't know how you could say this is easy. Why wouldn't you either create the directory if its not there, or save it to the same folder the .exe is run from???
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
yes you're right, I missed that check which you mentioned mostly b/c. I wrote this all stuff in a few hours time.
Saving it in the same folder as executable does't fulfill my original problem which I coded it for, thats why I put a member variable Directory inside the class.
-Attari-
|
| Sign In·View Thread·PermaLink | 2.75/5 |
|
|
|
 |
|
|