|
|
 Prize winner in Competition
"MFC/C++ Feb 2004"
Comments and Discussions
|
|
 |

|
CLiteHTMLElemAttr::parseFromStr(LPCTSTR lpszString) returns too many characters in when compiled in Unicode. Line 705 of LiteHTMLAttributes.h should be changed as follows:
return (UINT)((lpszEnd - lpszString) +
(ch == _T('\'') || ch == _T('\"') ? sizeof(TCHAR) : 0) );
Should be changed to
return (UINT)((lpszEnd - lpszString) +
(ch == _T('\'') || ch == _T('\"') ? 1 : 0) );
|
|
|
|

|
I added some additional entities to CLiteHTMLEntityResolver::CCharEntityRefs::CCharEntityRefs() in file LiteHTMLEntityResolver.h
W3Schools Entity Reference
(*this)[_T("forall")] = _T('\x2200');
(*this)[_T("part") ] = _T('\x2202');
(*this)[_T("exist") ] = _T('\x2203');
(*this)[_T("empty") ] = _T('\x2205');
(*this)[_T("nabla") ] = _T('\x2207');
(*this)[_T("isin") ] = _T('\x2208');
(*this)[_T("notin") ] = _T('\x2209');
(*this)[_T("ni") ] = _T('\x220b');
(*this)[_T("prod") ] = _T('\x220f');
(*this)[_T("sum") ] = _T('\x2211');
(*this)[_T("minus") ] = _T('\x2212');
(*this)[_T("lowast")] = _T('\x2217');
(*this)[_T("radic") ] = _T('\x221a');
(*this)[_T("prop") ] = _T('\x221d');
(*this)[_T("infin") ] = _T('\x221e');
(*this)[_T("ang") ] = _T('\x2220');
(*this)[_T("and") ] = _T('\x2227');
(*this)[_T("or") ] = _T('\x2228');
(*this)[_T("cap") ] = _T('\x2229');
(*this)[_T("cup") ] = _T('\x222a');
(*this)[_T("int") ] = _T('\x222b');
(*this)[_T("there4")] = _T('\x2234');
(*this)[_T("sim") ] = _T('\x223c');
(*this)[_T("cong") ] = _T('\x2245');
(*this)[_T("asymp") ] = _T('\x2248');
(*this)[_T("ne") ] = _T('\x2260');
(*this)[_T("equiv") ] = _T('\x2261');
(*this)[_T("le") ] = _T('\x2264');
(*this)[_T("ge") ] = _T('\x2265');
(*this)[_T("sub") ] = _T('\x2282');
(*this)[_T("sup") ] = _T('\x2283');
(*this)[_T("nsub") ] = _T('\x2284');
(*this)[_T("sube") ] = _T('\x2286');
(*this)[_T("supe") ] = _T('\x2287');
(*this)[_T("oplus") ] = _T('\x2295');
(*this)[_T("otimes")] = _T('\x2297');
(*this)[_T("perp") ] = _T('\x22a5');
(*this)[_T("sdot") ] = _T('\x22c5');
(*this)[_T("Alpha") ] = _T('\x391');
(*this)[_T("Beta") ] = _T('\x392');
(*this)[_T("Gamma") ] = _T('\x393');
(*this)[_T("Delta") ] = _T('\x394');
(*this)[_T("Epsilon") ] = _T('\x395');
(*this)[_T("Zeta") ] = _T('\x396');
(*this)[_T("Eta") ] = _T('\x397');
(*this)[_T("Theta") ] = _T('\x398');
(*this)[_T("Iota") ] = _T('\x399');
(*this)[_T("Kappa") ] = _T('\x39a');
(*this)[_T("Lambda") ] = _T('\x39b');
(*this)[_T("Mu") ] = _T('\x39c');
(*this)[_T("Nu") ] = _T('\x39d');
(*this)[_T("Xi") ] = _T('\x39e');
(*this)[_T("Omicron") ] = _T('\x39f');
(*this)[_T("Pi") ] = _T('\x3a0');
(*this)[_T("Rho") ] = _T('\x3a1');
(*this)[_T("Sigma") ] = _T('\x3a3');
(*this)[_T("Tau") ] = _T('\x3a4');
(*this)[_T("Upsilon") ] = _T('\x3a5');
(*this)[_T("Phi") ] = _T('\x3a6');
(*this)[_T("Chi") ] = _T('\x3a7');
(*this)[_T("Psi") ] = _T('\x3a8');
(*this)[_T("Omega") ] = _T('\x3a9');
(*this)[_T("alpha") ] = _T('\x3b1');
(*this)[_T("beta") ] = _T('\x3b2');
(*this)[_T("gamma") ] = _T('\x3b3');
(*this)[_T("delta") ] = _T('\x3b4');
(*this)[_T("epsilon") ] = _T('\x3b5');
(*this)[_T("zeta") ] = _T('\x3b6');
(*this)[_T("eta") ] = _T('\x3b7');
(*this)[_T("theta") ] = _T('\x3b8');
(*this)[_T("iota") ] = _T('\x3b9');
(*this)[_T("kappa") ] = _T('\x3ba');
(*this)[_T("lambda") ] = _T('\x3bb');
(*this)[_T("mu") ] = _T('\x3bc');
(*this)[_T("nu") ] = _T('\x3bd');
(*this)[_T("xi") ] = _T('\x3be');
(*this)[_T("omicron") ] = _T('\x3bf');
(*this)[_T("pi") ] = _T('\x3c0');
(*this)[_T("rho") ] = _T('\x3c1');
(*this)[_T("sigmaf") ] = _T('\x3c2');
(*this)[_T("sigma") ] = _T('\x3c3');
(*this)[_T("tau") ] = _T('\x3c4');
(*this)[_T("upsilon") ] = _T('\x3c5');
(*this)[_T("phi") ] = _T('\x3c6');
(*this)[_T("chi") ] = _T('\x3c7');
(*this)[_T("psi") ] = _T('\x3c8');
(*this)[_T("omega") ] = _T('\x3c9');
(*this)[_T("thetasym")] = _T('\x3d1');
(*this)[_T("upsih") ] = _T('\x3d2');
(*this)[_T("piv") ] = _T('\x3d6');
(*this)[_T("OElig") ] = _T('\x152');
(*this)[_T("oelig") ] = _T('\x153');
(*this)[_T("Scaron")] = _T('\x160');
(*this)[_T("scaron")] = _T('\x161');
(*this)[_T("Yuml") ] = _T('\x178');
(*this)[_T("fnof") ] = _T('\x192');
(*this)[_T("circ") ] = _T('\x2c6');
(*this)[_T("tilde") ] = _T('\x2dc');
(*this)[_T("ensp") ] = _T('\x2002');
(*this)[_T("emsp") ] = _T('\x2003');
(*this)[_T("thinsp")] = _T('\x2009');
(*this)[_T("zwnj") ] = _T('\x200c');
(*this)[_T("zwj") ] = _T('\x200d');
(*this)[_T("lrm") ] = _T('\x200e');
(*this)[_T("rlm") ] = _T('\x200f');
(*this)[_T("ndash") ] = _T('\x2013');
(*this)[_T("mdash") ] = _T('\x2014');
(*this)[_T("lsquo") ] = _T('\x2018');
(*this)[_T("rsquo") ] = _T('\x2019');
(*this)[_T("sbquo") ] = _T('\x201a');
(*this)[_T("ldquo") ] = _T('\x201c');
(*this)[_T("rdquo") ] = _T('\x201d');
(*this)[_T("bdquo") ] = _T('\x201e');
(*this)[_T("dagger")] = _T('\x2020');
(*this)[_T("Dagger")] = _T('\x2021');
(*this)[_T("bull") ] = _T('\x2022');
(*this)[_T("hellip")] = _T('\x2026');
(*this)[_T("permil")] = _T('\x2030');
(*this)[_T("prime") ] = _T('\x2032');
(*this)[_T("Prime") ] = _T('\x2033');
(*this)[_T("lsaquo")] = _T('\x2039');
(*this)[_T("rsaquo")] = _T('\x203a');
(*this)[_T("oline") ] = _T('\x203e');
(*this)[_T("euro") ] = _T('\x20ac');
(*this)[_T("trade") ] = _T('\x2122');
(*this)[_T("larr") ] = _T('\x2190');
(*this)[_T("uarr") ] = _T('\x2191');
(*this)[_T("rarr") ] = _T('\x2192');
(*this)[_T("darr") ] = _T('\x2193');
(*this)[_T("harr") ] = _T('\x2194');
(*this)[_T("crarr") ] = _T('\x21b5');
(*this)[_T("lceil") ] = _T('\x2308');
(*this)[_T("rceil") ] = _T('\x2309');
(*this)[_T("lfloor")] = _T('\x230a');
(*this)[_T("rfloor")] = _T('\x230b');
(*this)[_T("loz") ] = _T('\x25ca');
(*this)[_T("spades")] = _T('\x2660');
(*this)[_T("clubs") ] = _T('\x2663');
(*this)[_T("hearts")] = _T('\x2665');
(*this)[_T("diams") ] = _T('\x2666');
|
|
|
|

|
Hello Gurmeet
Did u port this Library such that it does not need MFC anymore now Gurmeet ?
Kamal
|
|
|
|

|
Hi Gurmeet
Have a ques for you like i am getting following errors :
---------
Error 2 error LNK2001: unresolved external symbol "protected: virtual unsigned int __thiscall CLiteHTMLReader::parseDocument(void)" (?parseDocument@CLiteHTMLReader@@MAEIXZ) Test_Proj.obj
Error 3 error LNK2019: unresolved external symbol "public: unsigned int __thiscall CLiteHTMLReader::Read(char const *)" (?Read@CLiteHTMLReader@@QAEIPBD@Z) referenced in function _main Test_Proj.obj
Error 4 error LNK2019: unresolved external symbol "private: static class CLiteHTMLEntityResolver::CCharEntityRefs CLiteHTMLEntityResolver::m_CharEntityRefs" (?m_CharEntityRefs@CLiteHTMLEntityResolver@@0VCCharEntityRefs@1@A) referenced in function "public: static unsigned int __cdecl CLiteHTMLEntityResolver::resolveEntity(char const *,char &)" (?resolveEntity@CLiteHTMLEntityResolver@@SAIPBDAAD@Z) Test_Proj.obj
Error 5 error LNK2019: unresolved external symbol "private: static class CMap<class ATL::CStringT<char,class StrTraitMFC<char,class ATL::ChTraitsCRT<char> > >,char const *,unsigned long,unsigned long> CLiteHTMLElemAttr::_namedColors" (?_namedColors@CLiteHTMLElemAttr@@0V?$CMap@V?$CStringT@DV?$StrTraitMFC@DV?$ChTraitsCRT@D@ATL@@@@@ATL@@PBDKK@@A) referenced in function "private: static void __cdecl CLiteHTMLElemAttr::Init(void)" (?Init@CLiteHTMLElemAttr@@CAXXZ) Test_Proj.obj
Error 6 fatal error LNK1120: 4 unresolved externals C:\Documents and Settings\rohit_sahni\My Documents\Visual Studio 2005\Projects\HTML_Reader\Debug\TestProject.exe
---------
while i tried to build the code.
Can you please help.
Thanks
Ro..
|
|
|
|

|
Hi, I try use Html Reader but there is no example , tutorial or test in source files; How to use and implement a single html parse to extract specific data? Can you help me?
thank you
|
|
|
|

|
stdafx.h Error: No such file or directory.
Help me!
|
|
|
|

|
the same problem in vs 2008
|
|
|
|

|
Hi,
I have downloaded your lib and added it to my MFC project. I get over 306 errors with code C2679 when I am compiling the project in Visual Studio 2005. They are all in the file LiteHTMLAttributes.h. The compiler have some problems with lines beginning with
_namedColors["something"] =
It says: "error C2679: binary operator '[' : no operator found which takes a right-hand operand of type 'const char [13]' (or there is no acceptable conversion) "
I get also some warnings of type C4244 in the file LiteHTMLEntityResolver.
warning C4244: 'return' conversion from '__w64 int' to 'UINT', possible loss of data litehtmlentityresolver.h 229
warning C4244: 'Argument' conversion from '__w64 int' to 'UINT', possible loss of data litehtmlentityresolver.h 236
warning C4244: 'return' conversion from '__w64 int' to 'UINT', possible loss of data litehtmlentityresolver.h 280
Please give me a solution for this warnings/errors and correct your code in the libs.
Best regards
|
|
|
|

|
please use:
_namedColors[_T("something")]
I think therefore I am.
——Rene Descartes
|
|
|
|

|
CAST it like this _namedColors[(CString)"activeborder"]
good luck
|
|
|
|

|
will not get the right attribut of the tag a because of "//"
this bug occurs in the file LiteHTMLAttributes.h function parseFromStr
please check it.
|
|
|
|

|
First you must apply my previous fixes in order to have a class which will work correctly with this code.
This is a nice feature to have using this class; after few hours of researches I've find out a solution:
suppose we need the tag <div class=myclass>, which lies deep inside a html page.
1. we have to implement CEventHandler : public ILiteHTMLReaderEvents
2. we must handle StartTag, EndTag notifications;
3. we must have some variables inside this class to store the desired m_tagname, m_attrib, m_attrib_value;
4. inside StartTag we receive notifications for each tag the parser finds; we consult the pTag ponter for tagname, value, valuename; if we find that tag, we init a bool bCanStartSearch = 1;
BOOL m_bCanStartSearch
CString m_szTagStack;
StartTag{...}{
if(m_tagname==pTag->getTagname&&attrib==m_attrib&&attribval==m_attrib_value){
StoreTagData(pTag); bCanStartSearch = 1;
}
if(bCanStartSearch){
m_szTagStack="/"+pTag->getTagname+m_szTagStack
}
5.on EndTag if we started tracing we can delete tags added inside StartTag; we delete the tag from the begining of the string, if the deleted tag matches the last added one.
BOOL m_bTagFoud = 0;
EndTag(...)
{
CString szDeletedTag = "/"+pTag->getName()
if(bCanStartSearch){
if(m_szTagStack==szDeletedTag){
m_bTagFoud = 1;
StoreTagData(pTag);
}
if(m_szTagStack.Find(szDeletedTag)==0)
m_szTagStack = szTagStack.Right(m_szTagStack.GetLength()-szDeletedTag.GetLength())
}
}
6. we must handle some special situations for tags like <br> and <img> which gets added and never deleted with EndTag, because they don't have all the time the ending character like this <br/> for this I have added a small function inside >CLiteHTMLTag
BOOL IsTagInline(){ return m_bIsInline;};
where
m_bIsInline = bClosingTag&bOpeningTag;
is filled during tag parsing inside CLiteHTMLTag::parseFromStr
so add inside StartTag
if(bCanStartSearch){
if(pTag->getTagName()=="br"||pTag->getTagName()=="img"){
if(pTag->IsTagInline())
m_szTagStack="/"+pTag->getTagname+m_szTagStack;
}
else
m_szTagStack="/"+pTag->getTagname+m_szTagStack;
}
perhaps there are more tags to handle in this way, or perhaps somebody finds another method for handling this
7. the last thing is to handle the start, and end position for each tag <tagname> and <tagname/>; for this we must add 2 LPCSTR pointers inside CLiteHTMLTag which we will fill at parsing time;
LPCSTR m_pTagStartPos,m_pTagEndPos, where m_pTagStartPos points to "<" and m_pTagEndPos to ">";
at the end og CLiteHTMLTag::parseFromStr fill these vars
m_pTagEndPos = lpszEnd;
m_pTagStartPos = lpszString;
and add 2 functions for easy access to this members... like
GetTagStart(){ return m_pTagStartPos;};
now we store start/end tag pointers inside
StoreTagData(pTag)
{
if(m_bTagFoud){
m_endTagStart = pTag->GetTagStart;
m_endTagEnd = pTag>GetTagEnd;
}
else{
m_startTagStart = pTag->GetTagStart;
m_startTagEnd = pTag->etTagEnd;
}
}
finally we can define some functions inside our CEventHandler to retrive inner/outer html
CString CEventHandler::get_outerHTML(){
CString szRet;
if(!m_bEndTagFound){
return "";
}
szRet = CString(m_startTagStart, m_startTagEnd - m_startTagStart);
return szRet;
}
add youself get_innerHTML();
using this code we can retrive html code for a given tagname, and we don't need mshtml leek generator anymore
|
|
|
|

|
full of errors, could ya give a complete code samlpe? thx!
|
|
|
|

|
There is a problem with this class, it doesn't handle correctly the following situation: </tagname > or <tagname >, when more spaces are found after the tagname; to fix this, inside CLiteHTMLTag::parseFromStr, add:
//fix: rem white spaces till the end </tagname > or <tagname >
while (::_istspace(*lpszEnd))
lpszEnd = ::_tcsinc(lpszEnd);
// is this a closing tag?
if (bClosingTag) also this class will fail to parse correctly html which has <script> inside, because of the fact that inside the scripts we can have following situation: document.write "<div>"); document.write("</" + "div>"); this will fool the tokenizer, which won't be able to find the end of the tag; to fix this we need to skip processing for script elements. to fix this in CLiteHTMLReader::parseDocument add:
CLiteHTMLTag oTag; // tag information
bool bInsideScript = 0; and few lines down
if (!parseComment(strComment))
{
bIsOpeningTag = false;
bIsClosingTag = false;
if (!parseTag(oTag, bIsOpeningTag, bIsClosingTag, bInsideScript))
{
++dwCharDataLen;
// manually advance buffer position
// because the last call to UngetChar()
// moved it back one character
ch = ReadChar();
break;
}
else
{
//WE ENTER IN SCRIPT MODE
if(bIsOpeningTag&&!bInsideScript){
if(!oTag.getTagName().CompareNoCase("script"))
if(!oTag.IsTagInline())
bInsideScript = 1;
}
if(bIsClosingTag&&bInsideScript){
if(!oTag.getTagName().CompareNoCase("script"))
bInsideScript = 0;
}
}
}
change also the definitions adding the param bInsideScript for
CLiteHTMLReader::parseTag(CLiteHTMLTag &rTag, bool &bIsOpeningTag, bool &bIsClosingTag,bool &bIsInsideScript) and
inline UINT CLiteHTMLTag::parseFromStr(LPCTSTR lpszString, bool &bIsOpeningTag,bool &bIsClosingTag,bool &bIsInsideScript,bool bParseAttrib )
go and add inside CLiteHTMLTag::parseFromStr just where we've added the first mod:
if(bIsInsideScript){
if (!bClosingTag)
return 0U;
if(strTagName.CompareNoCase("script"))
return 0U;
}
while (::_istspace(*lpszEnd))
lpszEnd = ::_tcsinc(lpszEnd);
if (bClosingTag)
{
oTag.getTagName() is defined like this
BOOL IsTagInline(){ return m_bIsInline;};inline UINT CLiteHTMLTag::parseFromStr(LPCTSTR lpszString,bool &bIsOpeningTag,bool &bIsClosingTag,bool &bIsInsideScript,bool bParseAttrib )
{
....
m_bIsInline = bClosingTag&bOpeningTag;
return (nRetVal);
}
with this fixes we can parse correctly html files which have scripts inside.
this code helped me to get read of the mshtml parser which is in fact a mem leak generator; i had it used inside one of my software eZWeather, and constantly this mshtml working with a complex html page, increased the size of the program in mem with each hour, no matter of any tricks I have used (see [ ^]here my comments)
modified on Saturday, April 26, 2008 6:43 PM </script>
|
|
|
|

|
Hi,
Thanks for making such a good parser.
This parser uses MFC libraries which makes it platform dependent.
Do you have any parser which is platform independent.
If yes, could u please send me the code.
Secondly, Can i have the permission to modify your code for commercial use as per my need,
Thanks in advance.
With Regards,
Sumit Modi
|
|
|
|

|
class CEventHandler : public ILiteHTMLReaderEvents
{
private:
void BeginParse(DWORD dwAppData, bool &bAbort);
void StartTag(CLiteHTMLTag *pTag, DWORD dwAppData, bool &bAbort);
void EndTag(CLiteHTMLTag *pTag, DWORD dwAppData, bool &bAbort);
void Characters(const CString &rText, DWORD dwAppData, bool &bAbort);
void Comment(const CString &rComment, DWORD dwAppData, bool &bAbort);
void EndParse(DWORD dwAppData, bool bIsAborted);
};
int main(int argc, char* argv[])
{
CEventHandler theEventHandler;
CLiteHTMLReader theReader;
theReader.setEventHandler(&theEventHandler);
}
I tried step 4 and found error
can you explain how to fix this error ?
error LNK2001: unresolved external symbol "private: virtual void __thiscall CEventHandler::EndParse(unsigned long,bool)" (?EndParse@CEventHandler@@EAEXK_N@Z)
|
|
|
|
|

|
people..you must learn some c++
there are only the functions declarations:
void EndParse(DWORD dwAppData, bool bIsAborted);
the linker error tells that you must implement a body for each function, like:
void CHtmlEventHandler::EndParse(DWORD dwAppData, bool bIsAborted);
{
AfxMessageBox("we have finished the parsing!");
}
|
|
|
|

|
Thank you so much... works like a charm.
|
|
|
|

|
I have got a version of HTMLReader working under UNICODE. The only thing I'm not sure of is ReadFile (no need to use in the context of my application, so not tested under UNICODE).
The only corrections required in order to use it with Read (string) for a UNICODE string were:
1) wrap string constants in _T() (about 166 or so of them)
2) change TRACE1 to TRACE (TRACE1 seems problematic under UNICODE)
3) fix a character counting flaw in attribute handling (counts a sizeof(TCHAR) where it should be just 1)
If anyone is interested, let me know.
|
|
|
|

|
where are the sizeof(TCHAR) entries that need to be changed to 1?
Nevermind, I found them.
One is in LiteHTMLAttributes.h, Two are in LiteHTMLReader.h, and One is in LiteHTMLReader.cpp
|
|
|
|

|
Not sure if anyone is still watching this article but...How would I grap the text right after a tag, like in the situation of a link. <a href="http://www.sample.com">The Sample Website</a> how would I retreive the text "The Sample Website"? Thanks a lot.
|
|
|
|

|
In your example, your event handler will see a sequence of three calls: 1) StartTag - for the <a> tag (with the href attribute), 2) Characters - for the text appearing between the start and end tags, and 3) EndTag - for the </a> ending tag If you want to collect the link into an application object, you'd have to create/initialize it when you get the StartTag call for the <a ...>, and gather all the text which appears in subsequent Characters calls until you get the EndTag call for the </a>. Bear in mind that if you have other tags occurring between the <a href=...> and </a>, or newlines for that matter, you'll get a sequence of Characters calls, not just one. e.g. <a href="http://www.sample.com"> The Sample Website </a> will give you at least 4 Characters calls because of all the newlines.
|
|
|
|

|
parse a html file(572KB),get all image URL from this file,take me 57seconds,
how can i make it faster,could u give me some code sample?
Joe Lee
|
|
|
|

|
dirlee wrote: parse a html file(572KB),get all image URL from this file,take me 57seconds,
how can i make it faster,could u give me some code sample?
That does seem like an awefully long time. I have to wonder what you're doing with a 572KB HTML file though!
I suggest you use either the VC++ Profiler or Glowcode (www.glowcode.com) and see where the hotspots are and fix them. With a bit of luck it won't take you long to get a significant improvement.
FYI The pugXML parser here on CP can parse a 10M XML file in less than second, using MMF.
Good luck.
Neville Franks, Author of Surfulater www.surfulater.com "Save what you Surf" and ED for Windows www.getsoft.com
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
A lightweight, fast, simple, and low-overhead C++ class library based on push model parsing.
| Type | Article |
| Licence | |
| First Posted | 29 Mar 2004 |
| Views | 198,510 |
| Downloads | 4,544 |
| Bookmarked | 155 times |
|
|