Retrieving the HTML source code






4.19/5 (17 votes)
Dec 1, 2004

116444
An article on how to retrieve the full source code of a web page.
Introduction
An app I was writing needed to store the full HTML of a web page. I looked all over the web and the MSDN library on how to get the complete HTML from a CHtmlView
. I found out how to get the <BODY></BODY>
data, but not how to get the <HTML></HTML>
data. After lots of stumbling, I hit on the following very simple technique.
Examples of getting the outer HTML of the <BODY>
tag abound. While exploring the IHTMLDocument2
interface, I noticed the get_ParentElement
method. I realized that the parent of <BODY>
is <HTML>
.
This function took care of my problem:
bool CMyHtmlView::GetDocumentHTML(CString &str) { IHTMLDocument2 *lpHtmlDocument = NULL; LPDISPATCH lpDispatch = NULL; lpDispatch = GetHtmlDocument(); if(!lpDispatch) return false; lpDispatch->QueryInterface(IID_IHTMLDocument2, (void**)&lpHtmlDocument); ASSERT(lpHtmlDocument); lpDispatch->Release(); IHTMLElement *lpBodyElm; IHTMLElement *lpParentElm; lpHtmlDocument->get_body(&lpBodyElm); ASSERT(lpBodyElm); lpHtmlDocument->Release(); // get_body returns all between <BODY> and </BODY>. // I need all between <HTML> and </HTML>. // the parent of BODY is HTML lpBodyElm->get_parentElement(&lpParentElm); ASSERT(lpParentElm); BSTR bstr; lpParentElm->get_outerHTML(&bstr); str = bstr; lpParentElm->Release(); lpBodyElm->Release(); return true; }
Points of Interest
There is bound to be a better way of doing this. If you know it, please share it with me.