Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++ XML HTML Parsing
I am using libxml2 in my VS2010 project to generate tree from HTML, find some nodes, modify it and dump tree back to HTML.
The main logic is:
// create parser
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, XML_CHAR_ENCODING_UTF8);
// set parser options
htmlCtxtUseOptions(parser, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
// parse HTML from pData
htmlParseChunk(parser, pData, dataLen, 0);
// get root node for generated tree
xmlNode* node = xmlDocGetRootElement(parser->myDoc);
// make some changes in tree, e.g. change content for node with name 'title'
// ...
// return back to HTML, result located in 'newHtml'
htmlDocDumpMemory(parser->myDoc, &newHtml, &len);
 
When I use HTML from http://www.youtube.com/watch?v=S77UrnEGs_g[^] as input data I get one excess in output.
 
I have checked above URL on http://validator.w3.org/[^] and get error:
Line 562, Column 31: Unclosed element div.
<div class="content">
My question is: could I configure libxml2 so it would not automatically close unclosed tags?
Posted 1-Apr-13 20:49pm
ant1488684

1 solution

Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

I found some non-validating parser: http://htmlcxx.sourceforge.net/[^], it solve the problem for me.
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 381
1 Sergey Alexandrovich Kryukov 245
2 Marcin Kozub 225
3 Praneet Nadkar 217
4 /\jmot 189
0 OriginalGriff 8,284
1 Sergey Alexandrovich Kryukov 7,407
2 DamithSL 5,614
3 Maciej Los 4,989
4 Manas Bhardwaj 4,986


Advertise | Privacy | Mobile
Web01 | 2.8.1411023.1 | Last Updated 3 Oct 2013
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100