|
 |
Prize winner in Competition "MFC/C++ Feb 2004"
|
|
|
Contents
Introduction
After failing to search for a class library that allows to read HTML text from either in-memory string buffers or physical disk files, I decided that there is a severe need to have a library like that. There are many parsers available for XML (eXtensible Markup Language), for instance, Simple API for XML (SAX), that allow you to parse XML simply by handling events that the reader generates as it parses specific symbols from the given XML document.
Inspired by the SAX parser for XML, I decided to develop an HTML Reader C++ Class Library myself from scratch that offers a simple, lightweight, fast, and the most important, a low-overhead solution to process an HTML document. Like SAX, I decided to develop an events-based parser, which raises events as it encounters various elements in the document. The advantage of an events-based parser is that the reader reads a section of an HTML document, generates an event, and then moves on to the next section. It uses less memory and is better for processing large documents.
Events-Based Parser
An events-based parser uses the callback mechanism to report parsing events. These callbacks turn out to be protected virtual member functions that you will override. Events, such as the detection of an opening tag or the closing tag of an element, will trigger a call to the corresponding member function of your class. The application implements and registers an event handler with the reader. It is upto the application to put some code in the event handlers designed to achieve the objective of the application. Events-based parsers provide a simple, fast, and a lower-level access to the document being parsed.
Events-based parsers do not create an in-memory representation of the source document. They simply parse the document and notify client applications about various elements they find along the way. What happens next is the responsibility of the client application. Events-based parsers don't cache information and have an enviably small memory footprint.
Files
To use the HTML Reader Class Library in your MFC application project, you will need to add a number of files in your project:
| Header File |
Source File |
Class |
| LiteHTMLReader.h |
LiteHTMLReader.cpp |
CLiteHTMLReader |
| LiteHTMLTag.h |
- |
CLiteHTMLTag |
| LiteHTMLAttributes.h |
- |
CLiteHTMLAttributes |
| LiteHTMLAttributes.h |
LiteHTMLAttributes.cpp |
CLiteHTMLElemAttr |
| LiteHTMLEntityResolver.h |
LiteHTMLEntityResolver.cpp |
CLiteHTMLEntityResolver |
NOTE: LiteHTMLCommon.h must also be included in your project.
Brief Description of Classes
CLiteHTMLReader is the main class of our library that works in conjunction with other CLiteHTML* classes to parse the given HTML document. It contains methods (Read and ReadFile) to initiate the parsing process that can operate on either in-memory string buffers or a physical disk file. CLiteHTMLReader allows you to trap events that the reader generates as it finds various elements in the document such as the starting of a tag, ending of a tag, an HTML comment, etc. But to handles these events, your application must define a class that implements an interface ILiteHTMLReaderEvents declared in the LiteHTMLReader.h file.
CLiteHTMLTag class, as its name implies, is related to the HTML tags. It deals with the parsing and storage of tag information from the given string such as the name of the tag and the attributes/properties of a tag. It provides a method (actually, all the above-specified classes provide a method named parseFromStr) that is called by the CLiteHTMLReader class' Read and ReadFile method as the document is being parsed. Typically, CLiteHTMLTag is not used directly by your application. As specified above, it works in conjunction with the reader helping in the parsing of HTML tags.
- The
CLiteHTMLElemAttr and CLiteHTMLAttributes classes are inter-related as CLiteHTMLAttributes provides a collection-based mechanism to hold an array of CLiteHTMLElemAttr objects that are accessible either by the name of the attribute or a zero-based index value. As was the case with the CLiteHTMLTag class, these classes are also not typically used by your application directly.
- The last is the
CLiteHTMLEntityResolver class that helps in resolving the entity references. Entity references are numeric or symbolic names for characters that may be included in an HTML document. They are useful for referring to rarely used characters, or those that authoring tools make it difficult or impossible to enter. Entity references begin with a "&" sign and end with a semi-colon (;). Some common examples are: < representing the < sign, > representing the > sign, etc.
From the above discussion, one thing is clear that the CLiteHTMLTag, CLiteHTMLAttributes, and, CLiteHTMLElemAttr class provide a method named parseFromStr that is used by the CLiteHTMLReader to further delegate the parsing process while reading an HTML document.
Usage
OK, now let's come to the part of learning how to use this library in an MFC project:
- The first step is pretty simple. All you have to do is to add all of the files (given in the FILES section above) in your project.
- The second step, although optional, is to create a class that implements
ILiteHTMLReaderEvents interface. ILiteHTMLReaderEvents is, in actual, an abstract class that acts as an interface which must be implemented by all those classes that need to handle events raised by the CLiteHTMLReader class. For example,
#include "stdafx.h"
#include "LiteHTMLReader.h"
class CEventHandler : public ILiteHTMLReaderEvents
{
private:
void BeginParse(DWORD dwAppData, bool &bAbort);
void StartTag(CLiteHTMLTag *pTag, DWORD dwAppData, bool &bAbort);
void EndTag(CLiteHTMLTag *pTag, DWORD dwAppData, bool &bAbort);
void Characters(const CString &rText, DWORD dwAppData, bool &bAbort);
void Comment(const CString &rComment, DWORD dwAppData, bool &bAbort);
void EndParse(DWORD dwAppData, bool bIsAborted);
};
You must have noticed that "optional" word I used above. The reason behind this is that if you do not provide your own implementation of the event handler(s), the ILiteHTMLReaderEvents class provides a default implementation that does nothing. To learn more about the ILiteHTMLReaderEvents interface, jump to the ILiteHTMLReaderEvents Described section of this article.
- The third step is to create an instance of the
CLiteHTMLReader class like this:
CLiteHTMLReader theReader;
- Now we should call either
Read or ReadFile method of the CLiteHTMLReader class???
NO! Our event handler implementation will not start receiving notifications until we register it with the reader by calling the setEventHandler method of the CLiteHTMLReader class. So, supposing that the name of our class that is implementing the ILiteHTMLReaderEvents interface is CEventHandler, the fourth step is to create an instance of the CEventHandler, and call setEventHandler by passing it the address of this instance variable.
CEventHandler theEventHandler;
theReader.setEventHandler(&theEventHandler);
Now, for all of you, who are thinking if it is possible to pass a NULL pointer to the setEventHandler method, the answer is YES and that too at any time you want. And not to mention, you can also change the event handler at any time by calling setEventHandler and passing the address of some other instance.
- Now, the fifth and the final step is to call either
Read or ReadFile method on the CLiteHTMLReader instance variable we created in step 3 by passing it the appropriate parameter i.e. if you decide to parse an in-memory string buffer, call the Read method and pass the address of the string you want to parse. In case, you need to parse an HTML document from a disk file, you can call another method ReadFile that is similar to Read but accepts a file handle (HANDLE) instead of a pointer to an array of characters. Take a look at the example:
TCHAR strToParse[] = _T("<HTML>"
"<HEAD>"
"<TITLE>"
"<!-- title goes here -->"
"</TITLE>"
"</HEAD>"
"<BODY LEFTMARGIN="15px">This is a sample HTML document.</BODY>"
"</HTML>");
theReader.Read(strToParse);
OR
CFile fileToParse;
if (fileToParse.Open(_T("test.html"), CFile::modeRead))
{
theReader.ReadFile(fileToParse.m_hFile);
fileToParse.Close();
}
More About Event Handling
The ILiteHTMLReaderEvents class presents an interface that must be implemented by all those classes that want to handle the notifications sent by the CLiteHTMLReader while parsing an HTML document. The order of events handled by the ILiteHTMLReaderEvents handler is determined by the order of information within the document being parsed. It's important to note that the interface includes a series of methods that the CLiteHTMLReader invokes during the parsing operation. The reader passes the appropriate information to the method's parameters. To perform some type of processing for a method, you simply add code to the method in your own ILiteHTMLReaderEvents implementation.
The common parameters received by all of the methods defined in ILiteHTMLReaderEvents class, except EndParse include:
dwAppData: A 32-bit application-specific data.
bAbort: You can set this parameter to either true or false according to your application's needs to specify whether the reader should continue parsing rest of the data in the buffer or aborts immediately after the current event handler completes processing.
The EndParse method receives bIsAborted parameter instead of the bAbort that signifies if EndParse has occured because of the normal parsing termination.
Along with the above-specified parameters, all of the methods except BeginParse and EndParse, receive some extra information (specific to the event) which is retrieved by the reader while parsing the HTML document. For instance, when an HTML tag (either opening or closing) is parsed, the StartTag or EndTag methods receive a pointer to a CLiteHTMLTag that contains the name of the tag and the attributes (if any) of the tag. Attribute information is retrieved only if the tag parsed is an opening tag as closing tags cannot contain any attribute/value pairs. If there is no attribute information associated with a CLiteHTMLTag, the pointer variable contains NULL. So it is obvious that EndTag method always receives a NULL pointer. It is the responsibility of an application (and a good programming practice) to check for NULL pointer before using it.
Similarly, the Comment and Characters method of the class receives a reference to a CString containing the extracted text. The Comment method receives rComment parameter containing the comment text excluding the delimeters i.e. without <!-- and -->. The Characters method receives a rText parameter that signifies either the contents of an element or some text that could not be parsed by the reader.
Class View
CLiteHTMLReader Class Members
| Member |
Description |
| |
|
CLiteHTMLReader() |
Constructs a CLiteHTMLReader object. |
| |
|
EventMaskEnum setEventMask(DWORD); |
Sets a new event mask. |
EventMaskEnum setEventMask(DWORD, DWORD); |
Changes the current event mask by adding and/or removing flags. |
EventMaskEnum getEventMask(void) const; |
Returns the event mask previously set by a call to setEventMask. |
| |
|
DWORD setAppData(DWORD); |
Sets application-specific data to be passed to event handlers. |
DWORD getAppData(void) const; |
Returns app-specific data previously set by a call to setAppData. |
| |
|
ILiteHTMLReaderEvents* setEventHandler(ILiteHTMLReaderEvents*); |
Registers an event handler with the reader. |
ILiteHTMLReaderEvents* getEventHandler(void) const; |
Returns the currently associated event handler. |
| |
|
UINT Read(LPCTSTR); |
Parses an HTML document from the specified string. |
UINT Read(HANDLE); |
Parses an HTML document from a file given its HANDLE. |
| |
|
CLiteHTMLTag Class Members
| Member |
Description |
| |
|
CLiteHTMLTag() |
Constructs a CLiteHTMLTag object. |
CLiteHTMLTag(CLiteHTMLTag&, bool) |
Constructs a CLiteHTMLTag object from an existing instance. The first parameter is the reference to a source CLiteHTMLTag, and the second parameter determines whether to make a copy or to take ownership of the encapsulated CLiteHTMLAttributes pointer. |
~CLiteHTMLTag() |
Destroys a CLiteHTMLTag object. |
| |
|
CString getTagName(void) const; |
Returns the name of the tag. |
| |
|
const CLiteHTMLAttributes* getAttributes(void) const; |
Returns a pointer to an attribute collection associated with this CLiteHTMLTag. |
| |
|
UINT parseFromStr(LPCTSTR, bool&, bool&, bool); |
Parses an HTML tag from the string specified by the first parameter. The second and third parameter receive a boolean true/false indicating that the tag parsed is an opening and/or closing tag, respectively. The fourth parameter specifies whether to parse tag's attributes also. |
| |
|
CLiteHTMLAttributes Class Members
| Member |
Description |
| |
|
CLiteHTMLAttributes() |
Constructs a CLiteHTMLAttributes object. |
CLiteHTMLAttributes(CLiteHTMLAttributes&, bool) |
Constructs a CLiteHTMLAttributes object from an existing instance. The first parameter is the reference to a source CLiteHTMLAttributes, and the second parameter determines whether to make a copy or to take ownership of the encapsulated pointer. |
~CLiteHTMLAttributes() |
Destroys a CLiteHTMLAttributes object. |
| |
|
int getCount() const; |
Returns the count of CLiteHTMLElemAttr items. |
| |
|
int getIndexFromName(LPCTSTR) const; |
Looks up the index of an attribute given its name. |
| |
|
CLiteHTMLElemAttr operator[](int) const; |
Returns a CLiteHTMLElemAttr object given an attribute's index. |
CLiteHTMLElemAttr getAttribute(int) const; |
Returns a CLiteHTMLElemAttr object given an attribute's index. |
| |
|
CLiteHTMLElemAttr operator[](LPCTSTR) const; |
Returns a CLiteHTMLElemAttr object given an attribute name. |
CLiteHTMLElemAttr getAttribute(LPCTSTR) const; |
Returns a CLiteHTMLElemAttr object given an attribute name. |
| |
|
CString getName(int) const; |
Returns the name of an attribute given its index. |
CString getValue(int) const; |
Returns the value of an attribute given its index. |
CString getValueFromName(LPCTSTR) const; |
Returns the value of an attribute given its name. |
| |
|
CLiteHTMLElemAttr* addAttribute(LPCTSTR, LPCTSTR); |
Adds a new CLiteHTMLElemAttr item to the collection. |
bool removeAttribute(int); |
Removes an CLiteHTMLElemAttr item from the collection. |
bool removeAll(void); |
Removes all CLiteHTMLElemAttr items from the collection. |
| |
|
UINT parseFromStr(LPCTSTR); |
Parses attribute/value pairs from the given string. |
| |
|
CLiteHTMLElemAttr Class Members
| Member |
Description |
| |
|
CString getName(void) const; |
Returns the name of an CLiteHTMLElemAttr. |
CString getValue(void) const; |
Returns the value of an CLiteHTMLElemAttr. |
| |
|
bool isColorValue(void) const; |
Determines if the attribute value contains a color reference. |
bool isNamedColorValue(void) const; |
Determines if the attribute value is a named color value. |
bool isSysColorValue(void) const; |
Determines if the attribute value is a named system color value. |
bool isHexColorValue(void) const; |
Determines if the attribute value is a color value in hexadecimal format. |
bool isPercentValue(void) const; |
Checks to see if the attribute contains a percent value. |
| |
|
COLORREF getColorValue(void) const; |
Returns the color value of the attribute. |
CString getColorHexValue(void) const; |
Returns the RGB value of the attribute in hexadecimal format. |
unsigned short getPercentValue() const; |
Returns a percent value of the attribute. |
| |
|
short getLengthValue(LengthUnitsEnum&) const; |
Returns a length value of the attribute. |
| |
|
operator bool() const; |
Converts attribute value to bool. |
operator BYTE() const; |
Converts attribute value to BYTE (unsigned char). |
operator double() const; |
Converts attribute value to double. |
operator short() const; |
Converts attribute value to signed short int. |
operator LPCTSTR() const; |
Returns the value of the attribute. |
| |
|
UINT parseFromStr(LPCTSTR); |
Parses an attribute/value pair from the given string. |
| |
|
License
This code may be used in compiled form in any way you desire (including commercial use). The code may be redistributed unmodified by any means providing it is not sold for profit without the authors written consent, and providing that this notice and the authors name and all copyright notices remains intact. However, this file and the accompanying source code may not be hosted on a website or bulletin board without the authors written permission.
This software is provided "AS IS" without express or implied warranty. The author accepts no liability for any damage/loss of business that this product may cause. Use it at your own risk!
| You must Sign In to use this message board. |
|
| | Msgs 1 to 25 of 67 (Total in Forum: 67) (Refresh) | FirstPrevNext |
|
|
 |
|
|
Hi,
I have downloaded your lib and added it to my MFC project. I get over 306 errors with code C2679 when I am compiling the project in Visual Studio 2005. They are all in the file LiteHTMLAttributes.h. The compiler have some problems with lines beginning with
_namedColors["something"] = It says: "error C2679: binary operator '[' : no operator found which takes a right-hand operand of type 'const char [13]' (or there is no acceptable conversion) "
I get also some warnings of type C4244 in the file LiteHTMLEntityResolver. warning C4244: 'return' conversion from '__w64 int' to 'UINT', possible loss of data litehtmlentityresolver.h 229 warning C4244: 'Argument' conversion from '__w64 int' to 'UINT', possible loss of data litehtmlentityresolver.h 236 warning C4244: 'return' conversion from '__w64 int' to 'UINT', possible loss of data litehtmlentityresolver.h 280
Please give me a solution for this warnings/errors and correct your code in the libs.
Best regards
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
will not get the right attribut of the tag a because of "//" this bug occurs in the file LiteHTMLAttributes.h function parseFromStr please check it.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
First you must apply my previous fixes in order to have a class which will work correctly with this code.
This is a nice feature to have using this class; after few hours of researches I've find out a solution:
suppose we need the tag <div class=myclass>, which lies deep inside a html page.
1. we have to implement CEventHandler : public ILiteHTMLReaderEvents 2. we must handle StartTag, EndTag notifications; 3. we must have some variables inside this class to store the desired m_tagname, m_attrib, m_attrib_value; 4. inside StartTag we receive notifications for each tag the parser finds; we consult the pTag ponter for tagname, value, valuename; if we find that tag, we init a bool bCanStartSearch = 1;
BOOL m_bCanStartSearch CString m_szTagStack;
StartTag{...}{ if(m_tagname==pTag->getTagname&&attrib==m_attrib&&attribval==m_attrib_value){ StoreTagData(pTag);//save tag start/end pointers bCanStartSearch = 1; } if(bCanStartSearch){ m_szTagStack="/"+pTag->getTagname+m_szTagStack }
5.on EndTag if we started tracing we can delete tags added inside StartTag; we delete the tag from the begining of the string, if the deleted tag matches the last added one.
BOOL m_bTagFoud = 0;
EndTag(...) { CString szDeletedTag = "/"+pTag->getName() if(bCanStartSearch){ if(m_szTagStack==szDeletedTag){ m_bTagFoud = 1; StoreTagData(pTag); } if(m_szTagStack.Find(szDeletedTag)==0) m_szTagStack = szTagStack.Right(m_szTagStack.GetLength()-szDeletedTag.GetLength()) }
}
6. we must handle some special situations for tags like <br> and <img> which gets added and never deleted with EndTag, because they don't have all the time the ending character like this <br/> for this I have added a small function inside >CLiteHTMLTag
BOOL IsTagInline(){ return m_bIsInline;};
where
m_bIsInline = bClosingTag&bOpeningTag; is filled during tag parsing inside CLiteHTMLTag::parseFromStr
so add inside StartTag
if(bCanStartSearch){ if(pTag->getTagName()=="br"||pTag->getTagName()=="img"){ if(pTag->IsTagInline()) m_szTagStack="/"+pTag->getTagname+m_szTagStack; } else m_szTagStack="/"+pTag->getTagname+m_szTagStack; }
perhaps there are more tags to handle in this way, or perhaps somebody finds another method for handling this
7. the last thing is to handle the start, and end position for each tag <tagname> and <tagname/>; for this we must add 2 LPCSTR pointers inside CLiteHTMLTag which we will fill at parsing time; LPCSTR m_pTagStartPos,m_pTagEndPos, where m_pTagStartPos points to "<" and m_pTagEndPos to ">"; at the end og CLiteHTMLTag::parseFromStr fill these vars
m_pTagEndPos = lpszEnd; m_pTagStartPos = lpszString; and add 2 functions for easy access to this members... like
GetTagStart(){ return m_pTagStartPos;};
now we store start/end tag pointers inside
StoreTagData(pTag) { //store end tag if(m_bTagFoud){ m_endTagStart = pTag->GetTagStart; m_endTagEnd = pTag>GetTagEnd; } else{ m_startTagStart = pTag->GetTagStart; m_startTagEnd = pTag->etTagEnd; } }
finally we can define some functions inside our CEventHandler to retrive inner/outer html
CString CEventHandler::get_outerHTML(){ CString szRet; if(!m_bEndTagFound){ return ""; }
szRet = CString(m_startTagStart, m_startTagEnd - m_startTagStart);
return szRet; }
add youself get_innerHTML();
using this code we can retrive html code for a given tagname, and we don't need mshtml leek generator anymore
|
| Sign In·View Thread·PermaLink | 3.50/5 (2 votes) |
|
|
|
 |
|
|
There is a problem with this class, it doesn't handle correctly the following situation: </tagname > or <tagname >, when more spaces are found after the tagname; to fix this, inside CLiteHTMLTag::parseFromStr, add:
//fix: rem white spaces till the end </tagname > or <tagname >
while (::_istspace(*lpszEnd)) lpszEnd = ::_tcsinc(lpszEnd);
// is this a closing tag? if (bClosingTag) also this class will fail to parse correctly html which has <script> inside, because of the fact that inside the scripts we can have following situation: document.write "<div>"); document.write("</" + "div>"); this will fool the tokenizer, which won't be able to find the end of the tag; to fix this we need to skip processing for script elements. to fix this in CLiteHTMLReader::parseDocument add: CLiteHTMLTag oTag; // tag information
bool bInsideScript = 0; and few lines down
if (!parseComment(strComment)) { bIsOpeningTag = false; bIsClosingTag = false; if (!parseTag(oTag, bIsOpeningTag, bIsClosingTag, bInsideScript)) { ++dwCharDataLen; // manually advance buffer position // because the last call to UngetChar() // moved it back one character ch = ReadChar();
break; } else { //WE ENTER IN SCRIPT MODE if(bIsOpeningTag&&!bInsideScript){ if(!oTag.getTagName().CompareNoCase("script")) if(!oTag.IsTagInline()) bInsideScript = 1; } if(bIsClosingTag&&bInsideScript){ if(!oTag.getTagName().CompareNoCase("script")) bInsideScript = 0; } } }
change also the definitions adding the param bInsideScript for CLiteHTMLReader::parseTag(CLiteHTMLTag &rTag, bool &bIsOpeningTag, bool &bIsClosingTag,bool &bIsInsideScript) and inline UINT CLiteHTMLTag::parseFromStr(LPCTSTR lpszString, bool &bIsOpeningTag,bool &bIsClosingTag,bool &bIsInsideScript,bool bParseAttrib /* = true */) go and add inside CLiteHTMLTag::parseFromStr just where we've added the first mod: //if is any other other than /script if(bIsInsideScript){ if (!bClosingTag) return 0U; if(strTagName.CompareNoCase("script")) return 0U; }
//fix: rem white spaces till end </tagname > or <tagname > while (::_istspace(*lpszEnd)) lpszEnd = ::_tcsinc(lpszEnd);
// is this a closing tag? if (bClosingTag) { oTag.getTagName() is defined like this BOOL IsTagInline(){ return m_bIsInline;};inline UINT CLiteHTMLTag::parseFromStr(LPCTSTR lpszString,bool &bIsOpeningTag,bool &bIsClosingTag,bool &bIsInsideScript,bool bParseAttrib /* = true */) { .... m_bIsInline = bClosingTag&bOpeningTag; return (nRetVal); } with this fixes we can parse correctly html files which have scripts inside. this code helped me to get read of the mshtml parser which is in fact a mem leak generator; i had it used inside one of my software eZWeather, and constantly this mshtml working with a complex html page, increased the size of the program in mem with each hour, no matter of any tricks I have used (see [ ^]here my comments)
modified on Saturday, April 26, 2008 6:43 PM </script>
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi, Thanks for making such a good parser. This parser uses MFC libraries which makes it platform dependent. Do you have any parser which is platform independent. If yes, could u please send me the code.
Secondly, Can i have the permission to modify your code for commercial use as per my need,
Thanks in advance.
With Regards, Sumit Modi
|
| Sign In·View Thread·PermaLink | 2.33/5 (3 votes) |
|
|
|
 |
|
|
class CEventHandler : public ILiteHTMLReaderEvents { private: void BeginParse(DWORD dwAppData, bool &bAbort); void StartTag(CLiteHTMLTag *pTag, DWORD dwAppData, bool &bAbort); void EndTag(CLiteHTMLTag *pTag, DWORD dwAppData, bool &bAbort); void Characters(const CString &rText, DWORD dwAppData, bool &bAbort); void Comment(const CString &rComment, DWORD dwAppData, bool &bAbort); void EndParse(DWORD dwAppData, bool bIsAborted); };
int main(int argc, char* argv[]) { CEventHandler theEventHandler; CLiteHTMLReader theReader; theReader.setEventHandler(&theEventHandler); }
I tried step 4 and found error
can you explain how to fix this error ?
error LNK2001: unresolved external symbol "private: virtual void __thiscall CEventHandler::EndParse(unsigned long,bool)" (?EndParse@CEventHandler@@EAEXK_N@Z)
|
| Sign In·View Thread·PermaLink | 1.25/5 (4 votes) |
|
|
|
 |
|
|
 |
|
|
people..you must learn some c++
there are only the functions declarations:
void EndParse(DWORD dwAppData, bool bIsAborted);
the linker error tells that you must implement a body for each function, like:
void CHtmlEventHandler::EndParse(DWORD dwAppData, bool bIsAborted); { AfxMessageBox("we have finished the parsing!"); }
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
I have got a version of HTMLReader working under UNICODE. The only thing I'm not sure of is ReadFile (no need to use in the context of my application, so not tested under UNICODE).
The only corrections required in order to use it with Read (string) for a UNICODE string were:
1) wrap string constants in _T() (about 166 or so of them) 2) change TRACE1 to TRACE (TRACE1 seems problematic under UNICODE) 3) fix a character counting flaw in attribute handling (counts a sizeof(TCHAR) where it should be just 1)
If anyone is interested, let me know.
|
| Sign In·View Thread·PermaLink | 1.00/5 (2 votes) |
|
|
|
 |
|
|
where are the sizeof(TCHAR) entries that need to be changed to 1?
Nevermind, I found them. One is in LiteHTMLAttributes.h, Two are in LiteHTMLReader.h, and One is in LiteHTMLReader.cpp
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Not sure if anyone is still watching this article but...How would I grap the text right after a tag, like in the situation of a link.
<a href="http://www.sample.com">The Sample Website</a>
how would I retreive the text "The Sample Website"?
Thanks a lot.
|
| Sign In·View Thread·PermaLink | 1.00/5 (2 votes) |
|
|
|
 |
|
|
In your example, your event handler will see a sequence of three calls:
1) StartTag - for the <a> tag (with the href attribute), 2) Characters - for the text appearing between the start and end tags, and 3) EndTag - for the </a> ending tag
If you want to collect the link into an application object, you'd have to create/initialize it when you get the StartTag call for the <a ...>, and gather all the text which appears in subsequent Characters calls until you get the EndTag call for the </a>. Bear in mind that if you have other tags occurring between the <a href=...> and </a>, or newlines for that matter, you'll get a sequence of Characters calls, not just one.
e.g.
<a href="http://www.sample.com"> The Sample Website </a>
will give you at least 4 Characters calls because of all the newlines.
|
| Sign In·View Thread·PermaLink | 2.00/5 (2 votes) |
|
|
|
 |
|
|
parse a html file(572KB),get all image URL from this file,take me 57seconds, how can i make it faster,could u give me some code sample?
Joe Lee
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
dirlee wrote: parse a html file(572KB),get all image URL from this file,take me 57seconds, how can i make it faster,could u give me some code sample?
That does seem like an awefully long time. I have to wonder what you're doing with a 572KB HTML file though!
I suggest you use either the VC++ Profiler or Glowcode (www.glowcode.com) and see where the hotspots are and fix them. With a bit of luck it won't take you long to get a significant improvement.
FYI The pugXML parser here on CP can parse a 10M XML file in less than second, using MMF.
Good luck.
Neville Franks, Author of Surfulater www.surfulater.com "Save what you Surf" and ED for Windows www.getsoft.com
|
| Sign In·View Thread·PermaLink | 2.00/5 (2 votes) |
|
|
|
 |
|
|
You have a wonderful library. I am trying to use it, but in debug mode I get a lot memory leak messages like this one:
strcore.cpp(118) : {387} normal block at 0x00437320, 24 bytes long. Data: < ligh> 01 00 00 00 0B 00 00 00 0B 00 00 00 6C 69 67 68
I do not have a clue where this is coming from.
-- modified at 16:07 Thursday 15th December, 2005
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
I saw this post and tried running my test project with memory leak detection enabled in the C-runtime.
When I use a call to _CrtDumpMemoryLeaks() at the end of my executable's _tmain method to dump memory leaks, I get the same results--75 or so leaks that look like:
strcore.cpp(118) : {379} normal block at 0x002FCC80, 22 bytes long. Data: < alic> 01 00 00 00 09 00 00 00 09 00 00 00 61 6C 69 63
However, if I comment out the call to _CrtDumpMemoryLeaks() and instead use the following code (at the beginning of my _tmain method) to dump leaks
_CrtSetDbgFlag( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
no leaks are reported. (See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vsdebug/html/vxcondetectingisolatingmemoryleaks.asp[^] for more info on leak reporting options.) Using this second method to dump leaks may be a little more reliable, because I imagine it causes leak checking to be done later, after the C-runtime has had a chance to do more cleanup. My guess is that maybe there were some statics/globals that didn't get cleaned up until after _tmain terminated; when _CrtDumpMemoryLeaks was called at the end of _tmain, it reported these as leaks.
I did a sanity check on the second reporting method and inserted a dummy leak into my code...the leak was reported as expected, so this method seems reliable.
So there may not be a memory leak in HTML Reader after all.
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
| |