Click here to Skip to main content
11,437,713 members (35,350 online)
Click here to Skip to main content

Loading and parsing HTML using MSHTML. 3rd way.

, 11 May 2002 CPOL
Rate this:
Please Sign up or sign in to vote.
Explains how to load HTML code from memory and parse it using MS technologies
<!-- Add the rest of your HTML here -->

Introduction

First let me explain why I called the article "3rd Way". I've already seen such articles on CodeGuru, explaining how to load and parse HTML file from memory. You may ask, so why I'm writing another guide? Well, below I'll show advantages and disadvantages that I found in those ways.

First one, which is also shown in MSDN , is to load HTML code using IStream interface. You can read the article about it here. If all you want is to put a new code into your document, you should definately use this one. But if you'll try to get tags from your document after you load HTML, you will get nothing. Just because they are still in parsing and you have to create an OnDocumentComplete handler and only than start to look inside your document.

When I realized this I went to look for another way that will give me document immediately after submitting a code. And yes, I found it! You can look at the great article by Asher Kobin at CodeGuru. It uses a new interface called IMarkupServices, introduced with MS Internet Explorer 5.0. I picked this code and made my own from it and started using it.... but suddenly I saw that when I'm saving my document to disk, the BODY tag has no attributes! I worked on this problem a whole day, trying to get it working, but... nothing. When you load your HTML code from memory into document, all attributes of BODY tag are gone. Still have no idea why it is happening and will be glad if someone will tell me.

Thus I came to MSDN again and found another, third way to load and parse HTML. I was so happy, so I decided to write my first article to CodeProject about it, which you are reading now Smile | :)

Code

For those, advanced programmers, that don't want to read a whole article, I will give a hint: loading HTML code is made by write() method of IHTMLDocument2 interface.

Now I'll explain how to do this from beginning.

Headers and imports

I'll assume here, that you have a standard MFC application (such as Dialog , SDI or MDI applications). First of all you have to initialize COM, since we gonna use MSHTML COM interfaces. This can be done in InitInstance() function of your application. Remember also to uninitialize COM in your ExitInstance():

BOOL CYourApp::InitInstance()
{
	CoInitialize(NULL);
	...
}
int CYourApp::ExitInstance() 
{
	...
	CoUninitialize();
	return CWinApp::ExitInstance();
}

Now in the file you are going to use MSHTML interfaces, include mshtml.h, comdef.h (for smart pointers) and import mshtml.tlb:

#include <comdef.h>
#include <mshtml.h>
#pragma warning(disable : 4146)	//see Q231931 for explaintation
#import <mshtml.tlb> no_auto_exclude

Where do I get a document?

Now let's get a pointer to IHTMLDocument interface. How you will get it? Depends on what you already have Smile | :) If you are hosting a WebBrowser control or using CHtmlView in your application, u can call GetDocument() function in store the return value in your pointer, but I will explain how to get a 'free' document, which is not attached to any control or view. This can be done by simple call to CoCreateInstance() function:

MSHTML::IHTMLDocument2Ptr pDoc;
HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, 
                              IID_IHTMLDocument2, (void**)&pDoc);

Validate that you have a valid pointer (not NULL) and move on.

Converting your HTML code

I'll assume that you have all HTML code you want to load in some variable called lpszHTMLCode. This can be CString or any other buffer, loaded for example from file on disk. We need to prepare it before passing to MSHTML. The problem is that MSHTML function we are going to use takes only SAFEARRAY as parameter. So let's convert our string to SAFEARRAY:

SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
VARIANT *param;
bstr_t bsData = (LPCTSTR)lpszHTMLCode;
hr =  SafeArrayAccessData(psa, (LPVOID*)&param);
param->vt = VT_BSTR;
param->bstrVal = (BSTR)bsData;

Last jump

Now we are ready to pass our SAFEARRAY to write() function. These 2 lines of code will do all dirty parsing work for you

hr = pDoc->write(psa);	//write your buffer
hr = pDoc->close();	//and closes the document, "applying" your code  

//Don't forget to free the SAFEARRAY!
SafeArrayDestroy(psa);

Of course, remember to check every your step, so your program never crush, I skipped it to keep the code simple.

Now, after all this work you have a pointer to IHTMLDocument2 interface, which gives you a lot of features, like getting particular tag, searching, inserting, replacing, deleting tag, just like you do it in JavaScript.

And remember, if you are using smart pointers (like I do here) you don't need to call Release() function, the object will be freed automatically.

"about:blank" bug workaround

Well, since we have no site "attached" to our document interface, all links (href, src) that are relative to document, will start with "about:blank" if you'll try to use IHTMLAnchorElement::href property. The way to get the exact link, as it is in HTML source, is to use IHTMLElement interface with nice function called getAttribute. Just remember that the second parameter of this function should be 2, it will tell to parser to return you text as is.

Of course same way you should work with IMG, LINK and other tags. The example project updated with this fix also. You can download it and see how I did it.

References

Ahser Kobin's article about parsing with IMarkupServices (CodeGuru)
Load HTML from Stream (MSDN)
MSHTML Reference (MSDN)
IHTMLDocument2 Reference (MSDN)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Philip Patrick
Team Leader Varonis
Israel Israel
I was born in small town Penza, Russia, in October 13th, 1975 yr. So my mother tongue is Russian. I finished the school there and learned in University, then I came to Israel and since then, I live there (or here *s*)
My profession is a C++ programmer under MS Windows platforms, but my hobby is Web development and ASP programming.

I started interesting in computers and programming somewere in 1990-1991 yrs., when my father brought home our first computer - Sinclair ZX Spectrum (he made it by himself). So I learned Basic and joined the Basic programmers club at my school (me and my friend were the only 2 guys from all school there, lol). After I finished the school (1992yr) I decided to continue my study at University and got specialization Operation Systems and Software Engineer. Although I still like my profession, but I always wanted something new, thus I learned HTML, Javascript and ASP which turned to be my hobby Smile | :)

Comments and Discussions

 
GeneralCoverting text file to HTML file Pin
bohjkly14-Jun-09 20:41
memberbohjkly14-Jun-09 20:41 
GeneralIHTMLDocument is empty Pin
keshavkrity14-Oct-08 6:58
memberkeshavkrity14-Oct-08 6:58 
Jokehelp on Microsoft WebBrowser ActiveX Pin
rupert_durans20-Nov-07 2:44
memberrupert_durans20-Nov-07 2:44 
GeneralRe: help on Microsoft WebBrowser ActiveX Pin
Philip Patrick20-Nov-07 3:33
memberPhilip Patrick20-Nov-07 3:33 
GeneralIHTMLDocument2::write documentation Pin
Sam Hobbs23-Jul-07 8:20
memberSam Hobbs23-Jul-07 8:20 
AnswerRe: IHTMLDocument2::write documentation Pin
Philip Patrick23-Jul-07 9:39
memberPhilip Patrick23-Jul-07 9:39 
Doh, it was written 5 years ago. Took me time to remember why I have posted itBig Grin | :-D
Back then there were lots of questions on how to quickly parse HTML text. The sample on MSDN that you linked shows how to write HTML into HTMLDocument, so not many people came to it when searching for parsing options. I simply linked between "parsing" and "document.write". Today it maybe an obvious solution, but that wasn't the case 5 years ago.

BTW the sample on MSDN didn't work because "->close()" is missing after "write()". Not sure it was a bug in HTMLDocument or bug in the sample. The

Philip Patrick
Web-site: www.stpworks.com
"Two beer or not two beer?" Shakesbeer

GeneralIHTMLElement Iterator Pin
Jeffrey Walton24-Dec-06 7:52
memberJeffrey Walton24-Dec-06 7:52 
GeneralRun-time error... Pin
shertay19-Jan-06 16:03
membershertay19-Jan-06 16:03 
GeneralBugfix for memory corruption Pin
Pete6514-Nov-05 14:36
memberPete6514-Nov-05 14:36 
Generalwhen call pdoc-&gt;write, the CPU reaches 100% and not return Pin
Anonymous4-Aug-05 22:45
sussAnonymous4-Aug-05 22:45 
GeneralRe: when call pdoc-&gt;write, the CPU reaches 100% and not return Pin
dchris_med23-Oct-05 20:03
memberdchris_med23-Oct-05 20:03 
GeneralRe: when call pdoc-&gt;write, the CPU reaches 100% and not return Pin
Anonymous27-Oct-05 0:42
sussAnonymous27-Oct-05 0:42 
GeneralCan't load mht files!! Pin
Tcpip20059-Apr-05 23:42
memberTcpip20059-Apr-05 23:42 
Questionhow to loading mht from a Stream Pin
riverclod9-Apr-05 23:41
memberriverclod9-Apr-05 23:41 
GeneralMemory Leak Pin
ronald_shan29-Mar-05 13:03
sussronald_shan29-Mar-05 13:03 
GeneralRe: Memory Leak Pin
Hewllet2-Aug-05 8:27
memberHewllet2-Aug-05 8:27 
GeneralRe: Memory Leak Pin
kenkw1-Jan-06 19:50
memberkenkw1-Jan-06 19:50 
GeneralRe: Memory Leak Pin
cristip325-Feb-06 2:22
membercristip325-Feb-06 2:22 
GeneralRe: Memory Leak Pin
cristip325-Feb-06 3:14
membercristip325-Feb-06 3:14 
GeneralRe: Memory Leak Pin
swapnil gulhane28-May-09 21:24
memberswapnil gulhane28-May-09 21:24 
GeneralRe: Memory Leak Pin
Member 376114822-Feb-12 8:26
memberMember 376114822-Feb-12 8:26 
GeneralSmall fix Pin
dchris_med1-Jan-05 4:10
memberdchris_med1-Jan-05 4:10 
QuestionRe: Small fix Pin
fehu8-May-06 21:41
memberfehu8-May-06 21:41 
GeneralRe: Small fix Pin
Jeffrey Walton24-Dec-06 4:51
memberJeffrey Walton24-Dec-06 4:51 
GeneralRe: Small fix Pin
dchris_med24-Dec-06 5:11
memberdchris_med24-Dec-06 5:11 
GeneralRe: Small fix Pin
Jeffrey Walton24-Dec-06 7:13
memberJeffrey Walton24-Dec-06 7:13 
GeneralRe: Small fix Pin
dchris_med24-Dec-06 11:07
memberdchris_med24-Dec-06 11:07 
GeneralMulti-threading &amp; script execution Pin
Ionut FIlip16-Oct-04 9:47
memberIonut FIlip16-Oct-04 9:47 
GeneralRe: Multi-threading &amp; script execution Pin
Ionut FIlip10-Nov-04 7:56
memberIonut FIlip10-Nov-04 7:56 
GeneralRe: Multi-threading &amp; script execution Pin
kamiru28-Dec-04 8:13
memberkamiru28-Dec-04 8:13 
QuestionAllow ActiveX execution in browser? Pin
dragomir10-Sep-04 21:34
memberdragomir10-Sep-04 21:34 
GeneralLoad xml with xslt from memory Pin
alexiworld14-May-04 5:38
memberalexiworld14-May-04 5:38 
GeneralMultithreading Pin
kzimir13-Oct-03 6:13
memberkzimir13-Oct-03 6:13 
QuestionHow can I get the IHTMLFrameBase interface! Pin
Anonymous1-Sep-03 2:33
sussAnonymous1-Sep-03 2:33 
AnswerRe: How can I get the IHTMLFrameBase interface! Pin
Nathan Moinvaziri1-Jul-06 18:59
memberNathan Moinvaziri1-Jul-06 18:59 
QuestionHow to list attributes from IHtmlElement? Pin
kzimir25-Aug-03 23:35
memberkzimir25-Aug-03 23:35 
GeneralScript Elements in the HTML page Pin
shivsun15-Jul-03 2:37
membershivsun15-Jul-03 2:37 
Question1: A guess to why &lt;body&gt; part is gone. 2: C# version? Pin
Steve Lu2-Jul-03 7:08
sussSteve Lu2-Jul-03 7:08 
AnswerRe: 1: A guess to why &lt;body&gt; part is gone. 2: C# version? Pin
Philip Patrick2-Jul-03 11:43
memberPhilip Patrick2-Jul-03 11:43 
GeneralRe: 1: A guess to why &lt;body&gt; part is gone. 2: C# version? Pin
Steve Lu3-Jul-03 6:43
sussSteve Lu3-Jul-03 6:43 
GeneralRe: 1: A guess to why &lt;body&gt; part is gone. 2: C# version? Pin
Basit Saleem19-Sep-03 1:08
sussBasit Saleem19-Sep-03 1:08 
GeneralRe: 1: A guess to why &lt;body&gt; part is gone. 2: C# version? Pin
Anonymous3-Apr-05 1:29
sussAnonymous3-Apr-05 1:29 
GeneralAutomatically sending web Requests to Web servers using webBrowser controls Pin
shivsun15-Jun-03 23:24
membershivsun15-Jun-03 23:24 
GeneralRe: Automatically sending web Requests to Web servers using webBrowser controls Pin
zhao wei30-Jul-05 21:14
memberzhao wei30-Jul-05 21:14 
Questionhow to bypass intermediate response pages Pin
shivsun13-Jun-03 0:24
membershivsun13-Jun-03 0:24 
GeneralProblem in getting the html content Pin
shivsun11-Jun-03 1:12
membershivsun11-Jun-03 1:12 
GeneralRe: Problem in getting the html content Pin
Philip Patrick11-Jun-03 1:42
memberPhilip Patrick11-Jun-03 1:42 
GeneralRe: Problem in getting the html content Pin
shivsun13-Jun-03 0:07
membershivsun13-Jun-03 0:07 
GeneralRe: Problem in getting the html content Pin
Priyank Bolia29-Apr-05 2:20
memberPriyank Bolia29-Apr-05 2:20 
GeneralNice artical! Pin
lilyshining27-May-03 0:11
memberlilyshining27-May-03 0:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.150506.1 | Last Updated 12 May 2002
Article Copyright 2002 by Philip Patrick
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid