Click here to Skip to main content
Click here to Skip to main content

Retrieving the HTML source code

By , 30 Nov 2004
 

Introduction

An app I was writing needed to store the full HTML of a web page. I looked all over the web and the MSDN library on how to get the complete HTML from a CHtmlView. I found out how to get the <BODY></BODY> data, but not how to get the <HTML></HTML> data. After lots of stumbling, I hit on the following very simple technique.

Examples of getting the outer HTML of the <BODY> tag abound. While exploring the IHTMLDocument2 interface, I noticed the get_ParentElement method. I realized that the parent of <BODY> is <HTML>.

This function took care of my problem:

bool CMyHtmlView::GetDocumentHTML(CString &str)
{
    IHTMLDocument2 *lpHtmlDocument = NULL;
    LPDISPATCH lpDispatch = NULL;

    lpDispatch = GetHtmlDocument();
    if(!lpDispatch)
        return false;

    lpDispatch->QueryInterface(IID_IHTMLDocument2, (void**)&lpHtmlDocument);
    ASSERT(lpHtmlDocument);
    lpDispatch->Release();

    IHTMLElement *lpBodyElm;
    IHTMLElement *lpParentElm;

    lpHtmlDocument->get_body(&lpBodyElm);
    ASSERT(lpBodyElm);
    lpHtmlDocument->Release();
    // get_body returns all between <BODY> and </BODY>. 
    // I need all between <HTML> and </HTML>.

    // the parent of BODY is HTML
    lpBodyElm->get_parentElement(&lpParentElm);
    ASSERT(lpParentElm);
    BSTR    bstr;
    lpParentElm->get_outerHTML(&bstr);
    str = bstr;

    lpParentElm->Release();
    lpBodyElm->Release();

    return true;
}

Points of Interest

There is bound to be a better way of doing this. If you know it, please share it with me.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Geno Carman
United States United States
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy methodmemberHughJampton10 May '05 - 2:24 
This works in VC6:
 
CString buffer;
CInternetSession isess;
CHttpFile* f = (CHttpFile*) isess.OpenURL("http://test.com/test.htm");
while(f->ReadString(buffer))
{
htmlstring += buffer;
}

 
The full source code for the page is now in a CString.
 
Hugh Jampton
GeneralRe: My methodmemberRancidCrabtree12 May '05 - 12:29 
Your method is much neater that mine. I appreciate your taking the time to show me.
 
I implemented yours, but discovered a memory leak. It is probably so obvious that you assumed I would do it, but the file needs to be closed, and the file pointer f needs to be destroyed.
 
bool MyApp::GetDocumentHTML(CString &str)
{
// From Hugh Jampton
    CString buffer;
    CInternetSession isess;
    CHttpFile* f = (CHttpFile*)isess.OpenURL(m_Browser.GetLocationURL());
    while(f->ReadString(buffer))
    {
	str += buffer;
    }
 
    f->Close();
    delete f;
    return true;
}

GeneralRe: My methodmemberSam NG17 Apr '06 - 23:29 
It is not quite the same.
 
CHtmlView will handle HTTP (302) redirect, Javascript/VBScript redirect, onload, etc.
 
Unless you are sure the page you want is exactly that URL (which usually not the case unless your site is static), using CHtmlView will be more close to WYSIWYG.
GeneralOne-line way of doing itsussjocool13 Dec '04 - 3:37 
Insteal of using the HTMLView or other tricks, why not simply download the whole HTML File ?
 
Add this line in your code :
 
URLDownloadToFile(0, "http://www.ANYTHING.com/index.html", "c:\\test.html", 0, 0);
 
and you will have the HTML file on your HD.
 
Jo
 
JoCool
GeneralRe: One-line way of doing itmemberRancidCrabtree13 Dec '04 - 17:47 
That's fine for getting the HTML onto disk, but I wanted access to it in my program.
My article erroneously states that I "needed to store the full HTML of a web page". In fact, my app modifies it and displays the modified copy.
Writing to disk and then reading from it opens the door for a lot of needless problems.
GeneralThe way I do itsitebuilderUwe Keim30 Nov '04 - 19:51 
This is the way I do it in a project of mine (for www.zeta-producer.com [^] to be exact Wink | ;-) ).
 
I don't know exactly, but I recall that I copied something from the CHtmlEditCtrl [^] class of MFC.
 
const CString CMyHtmlView::GetDocumentHtml() const
{
	MsHtml::IHTMLDocument2Ptr doc = GetHtmlDocument();
	IPersistStreamInitPtr stream = doc;
  
	// From AFXHTML.H.
	CStreamOnCString sstream;
  
	stream->Save( static_cast<IStream*>(&sstream), false );
  	
	CString result;
	VERIFY(sstream.CopyData( result ));
  
	return result;
}
Maybe this helps you.
 
--
Affordable Windows-based CMS: www.zeta-producer.com
 

GeneralRe: The way I do itmemberRancidCrabtree1 Dec '04 - 7:38 
Looks good, and I appreciate the feedback.
I'm still using Visual Studio 6.0 which does not support CStreamOnCString class.
GeneralRe: The way I do itsitebuilderUwe Keim1 Dec '04 - 17:03 
On VC 6, I simply copied the CStreamOnCString class from the MFC 7 library into my project to use it Smile | :)
 
--
Affordable Windows-based CMS: www.zeta-producer.com
 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 1 Dec 2004
Article Copyright 2004 by Geno Carman
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid