Click here to Skip to main content
12,078,795 members (50,019 online)
Rate this:
 
Please Sign up or sign in to vote.
See more: C++ XML HTML Parsing
I am using libxml2 in my VS2010 project to generate tree from HTML, find some nodes, modify it and dump tree back to HTML.
The main logic is:
// create parser
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, XML_CHAR_ENCODING_UTF8);
// set parser options
htmlCtxtUseOptions(parser, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
// parse HTML from pData
htmlParseChunk(parser, pData, dataLen, 0);
// get root node for generated tree
xmlNode* node = xmlDocGetRootElement(parser->myDoc);
// make some changes in tree, e.g. change content for node with name 'title'
// ...
// return back to HTML, result located in 'newHtml'
htmlDocDumpMemory(parser->myDoc, &newHtml, &len);

When I use HTML from http://www.youtube.com/watch?v=S77UrnEGs_g[^] as input data I get one excess in output.

I have checked above URL on http://validator.w3.org/[^] and get error:
Line 562, Column 31: Unclosed element div.
<div class="content">
My question is: could I configure libxml2 so it would not automatically close unclosed tags?
Posted 1-Apr-13 20:49pm
ant1488698

1 solution

Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 1

I found some non-validating parser: http://htmlcxx.sourceforge.net/[^], it solve the problem for me.
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month


Advertise | Privacy | Mobile
Web02 | 2.8.160212.1 | Last Updated 3 Oct 2013
Copyright © CodeProject, 1999-2016
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100