HTML5 is the next major revision of the html standard. If all works well, it should become the dominant markup in the nearest future ousting both html4 and xhtml1 from their cozy locations. A lot of people say HTML5 is the next big thing. In some sense, yes. But in another no. HTML5 isn’t another different markup language. It’s a specification that adds on to and removes some features from the already existing specifications for html4. It’s the next big thing in that it’s going to change the way we markup our html pages; it’ll add more meaning to elements making html pages more semantic. Apart from making the web more semantic html5 will also standardize a lot of features across major browsers. Finally, there’s going to be some elements that all browsers will implement and it would hopefully function the same way across these browsers. No browser will be left out including IE. Now, the ie6 death count down might even run faster. Check out the ie6 count down website at: http://www.ie6countdown.com/. Ok, that’s html5. What of xhtml1 and html4? Do they still exist and will they still exist? They still hang around and will for a while until all the browsers are standardized and old browser start to weather off.
All the html (and xhtml1) standards have parsers implemented in most non-trivial languages used frequently on the web to power web applications. There are xhtml1 and html4 parsers implemented in php, ruby, c++, and others. Most parsers use the libxml library in c to build and traverse the dom. It’s made for xml so the parser is very strict. The documentation and code for Libxml lives at: http://xmlsoft.org/. So libxml is appropriate for parsing xml but not for parsing the transitional versions of any html or xhtml. It’s not even appropriate for html5. HTML5 allows for some laxity on the side of the developer. That’s why there are parsers made specifically to parse HTML5 and no xhtml or html4 parser can appropriately parse HTML5. It’s different and COOL!!
HTML5 includes some new tags in its spec:
<article>, <aside>, <header>, <section>, to mention a few. All these tags make html pages more descriptive, correspondingly making the web more semantic. These new tags will make development and deployment of web bots easier because now web bots can identify the different parts of a page and know what the data contained within the different page elements represent. They’ll now know if a writing is an article (stand-alone), if it’s just tangentially related to it’s surrounding content, if it’s a header section, and even how to outline the headers (using hgroup), and so on. For more information on the semantics of the new html5 tags and their use, please see Dive into HTML5. I think it’s a really practical and non-trivial guide to the new and emerging HTML5 specification. These new tags alone could throw the already existing parsers for html4 and xhtml off the edge. But there’s more complication to the work html5 parsers must handle. HTML5 is so FORGIVING! The
<body> tags which were required in the previous html specifications are now IMPLIED! That means that your web html5 page need not use these pivotal tags at all. You can have a page that looks like this:
<li>My boy is coming
The above markup is represented the same way in dom as this:
<body><li>My boy is coming</li>
THOSE PIVOTAL ALL-IMPORTANT CAN'T-DO-WITHOUT TAGS ARE NOW IMPLIED.
It seems like a mess but it's not. Since every page will have to have these tags why not just help the author of the page define those tags as the page is read into the DOM? HTML5 parsers have to handle this situation. Soon, you'll see how the HTML5 parsers handle these weirdo syntax.
You might ask: what about HTML5 validation? HTML5 validation isn’t really necessary anymore due to the most forgiving syntax of HTML5. What’s there to validate when your page does not even need a root tag?
Some time ago, I was playing with some HTML Parsers and comparing how these parsers handle malformed html syntax. My tests were entirely written in php. I fed some malformed syntax to DOMDocument, HTML5lib, and the php simple dom parser. The PHP simple dom parser is basically the DOMDocument PHP parser on steroids. The Simple dom parser allows for easy traversal of the dom. For example, suppose, that you want to find all the image elements in an HTML page. Using DOMDocument library in php, you would write something like:
document.getElementsByTagName("img"); // returns a NodeList of the image Nodes in the DOM representation of your just
// created html document
Using the simple html dom, you can do this:
I cannot show all but some part of the whole rundown and results of tests I ran on the html parsers. I wouldn’t show you in exact terms either. I had some strings containing some well-formed html4-transitional, xhtml-transitional, html5 as well as malformed versions of the aforementioned markups.
$first = "<html>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
$second = "<li>Tell me all about ya!";
$third = "<body>
<p>I was with you</p>
Then I ran these html markup through the html5 parser and the DOMDocument library in PHP like this:
$dom1 = new DOMDocument("1.0", "utf-8");
$dom1->formatOutput = true;
$dom1->loadXML($first); echo $dom1->saveXML();
$dom3 = DOMImplementation::createDocument(null, 'html',
"-//W3C//DTD XHTML 1.0 Transitional//EN",
$dom3->formatOutput = true;
$html = $dom3->documentElement;
$html->loadHTML($first); echo $html->saveHTML();
$dom2 = DOMImplementation::createDocument(null, 'html',
"-//W3C//DTD HTML 4.01 Transitional//EN",
$dom2->formatOutput = true;
$html = $dom2->documentElement;
$html->loadHTML($first); echo $html4_document->saveHTML();
$dom1 = HTML5_Parser::parse($first); echo $dom1->saveHTML();
The tests I ran were a more than this and more complex. I stripped some details to make the tests I ran easier to comprehend. Plus, I ran it on the command line. That’s why I use new lines to demarcate individual tests instead of <br> tags.
So we love html5. It’s forgiving. It’s modern. It might eventually replace flash. It’s already on our iphones and smart phones and is implemented in all recent versions of major browsers (including ie). We don’t need to validate our pages again because we know the built-in browser parsers won’t spew out errors (good or bad thing? You be the judge…). We can start using it right away even on older browsers (we can just use modernizr and HTML5shivs to detect if some html5 features are present in a browser). There are tools out there to help us handle old browser! Ain’t that great? We’ve already started our tortuous journey to a more semantic yet forgiving web!