Nowadays everything is centered around the web. We are always either downloading or uploading some data. Our applications are also getting more chatty, since users want to synchronize their data. On the other side the update process is getting more and more direct, by placing volatile parts of our application in the cloud.
This unstoppable trend is now gaining momentum since more than a decade. Nowadays having a well-designed webpage is the cornerstone of every company. The good thing is that HTML is fairly simple and even people without any programming knowledge at all can create a page. In the most simple approximation we just insert some text in a file, and open it in the browser (probably after we renamed the file to *.html).
To make a long story short: Even in our applications we sometimes might need to communicate with a webserver to deliver some HTML. This is all quite fine and solved by the framework. We have powerful classes that handle the whole communication by knowing the required TCP and HTTP actions. However, once we need to do some work on the document we are basically lost. This is where AngleSharp comes into play.
The idea for AngleSharp has been born about a year ago (and I will explain in the next paragraphs why AngleSharp goes beyond HtmlAgilityPack or similar solutions). The main reason to use AngleSharp is actually to have access to the DOM as you would have in the browser. The only difference is that in this case you will use C# (or any other .NET language). There is another difference (by design), which changed the names of the properties and methods from camel case to pascal case (i.e. the first letter is capitalized).
Therefore before we go into details of the implementation we need to have a look what's the long-term goal of AngleSharp. Actually there are several goals:
- Parsers for HTML, XML, SVG, MathML and CSS
- Create CSS stylesheets / style rules
- Return the DOM of a document
- Run modifications on the DOM
- *Provide the basis for a possible renderer
The core parser is certainly given by the HTML5 parser. A CSS4 parser is a natural addition, since HTML documents contain stylesheet references and
style attributes. Having another XML parser that complies with the current W3C specification is a good addition, since SVG and MathML (which can also occur in HTML documents) can be parsed as XML documents. The only difference lies in the generated document, which has different semantics and a uses a different DOM.
The * point is quite interesting. The basic idea behind this is not that a browser that is written entirely in C# will be build upon AngleSharp (however, that could happen). The motivation here lies in creating a new, cross-platform UI framework, which uses HTML as a description language with CSS for styling. Of course this is a quite ambitious goal, and it will certainly not be solved by this library, however, this library would play an important in the creation of this framework.
In the next couple of sections we will walk through some of the important steps in creating an HTML5 and a CSS4 parser.
Writing an HTML5 parser is much harder than most people think, since the HTML5 parser has to handle a lot more than just those angle brackets. The main issue is that a lot of edge cases arise with not well-defined documents / document-fragments. Also formatting is not as easy as it seems, since some tags have to be treated different than others.
All in all it would be possible to write such a parser without the official specification, however, one either has to know all edge cases (and manage to bring them into code or on paper) or the parser will simple be only working on a fraction of all webpages.
Here the specification helps a lot and gives us the whole range of possible states with every mutation that is possible. The main workflow is quite simple: We start with a
Stream, which could either be directly from a file on the local machine, the data from the network or an already given
Stream is given to the preprocessor, which will control the flow of reading from the
Stream and buffering already read contents.
Finally we are ready to hand in some data to the tokenizer, which transforms the data from the preprocessor to a sequence of useful objects. These temporary objects are then used to construct the DOM. The tree construction might have to switch the state of the tokenizer on several occasions.
The following image shows the general scheme that is used for parsing HTML documents.
In the following sections we will walk through the most important parts of the HTML5 parser implementation.
Having a working stream preprocessor is the basis for any tokenization process. The tokenization process is the basis for the tree construction, as we will see in the next section. What does the tokenization process do exactly? The tokenization proces transforms the characters that have been processed by the input stream preprocesor to so-called tokens. Those tokens are objects, which are then used to construct the tree, which will be the DOM. In HTML there are not many different tokens. In fact we just have a few:
- Tag (with the name, an open/close flag, the tag's attributes and a self-closed flag)
- Doctype (with additional properties)
- Character (with the character payload)
- Comment (with the text payload)
The state-machine of the tokenizer is actually quite complicated, since there are many (legacy) rules that have to be respected. Also some of the states cannot be entered from the tokenizer alone. This is also kind of special in HTML as compared to most parsers. Therefore the tokenizer has to be open for changes, which will usually be initiated by the tree constructor.
The most used token is the character token. Since we might need to distinguish between single characters (for instance if we enter a
<pre> element an initial line feed character has to be ignored) we have to return single character tokens. The initial tokenizer state is the PCData state. The method is as simple as the following:
HtmlToken Data(Char c)
var value = CharacterReference(src.Next);
if (value == null) return HtmlToken.Character(Specification.AMPERSAND);
There are some states which cannot be reached from the PCData state. For instance the Plaintext or RCData states can never be entered from the tokenizer alone. Additionally the Plaintext state can never be left. The RCData state is entered when the HTML tree construction detects e.g. a
<title> or a
<textarea> element. On the other side we also have a Rawtext state that could be invoked by e.g. a
<noscript> element. We can already see that the number of states and rules is much bigger than we might initially think of.
A quite important helper for the tokenizer (and other tokenizers / parsers in the library) is the
SourceManager class. This class handles an incoming stream of (character) data. The definition is shown in the following image.
This helper is more a less a
Stream handler, since it takes a
Stream instance and reads it with a detected encoding. It is also possible to change the encoding during the reading process. In the future this class might change, since until now it is based on the
TextReader class to read text with a given
Encoding from a
Stream. In the future it might be better to handle that with a custom class, that supports reading backwards with a different encoding out of the box.
Once we have a working stream of tokens we can start constructing the DOM. There are several lists we need to take care of:
- Currently open tags.
- Active formatting elements.
- Special flags.
The first list is quite obvious. Since we will open tags, which will include other tags, we need to memorize what kind of path we've taken along the road. The second one is not so obvious. It could be that currently open elements have some kind of formatting effect on inserted elements. Such elements are considered to be formatting elements. A good example would be the
<b> tag (bold). Once it is applied all* contained elements will have bold text. There are some exceptions (*), but this is what makes the HTML5 non-trivial.
The third list is actually very non-trivial and impossible to reconstruct without the official specification. There are special cases for some elements in some scenarios. This is why the HTML5 parser distinguishes between
<select> and several other sections. This differentiation is also required to determine if certain elements have to be auto-inserted. For instance the following snippet is automatically transformed:
The HTML parser does not recognize the
<pre> tag as being a legal tag before the
<body> tag. Thus a fallack is initialized, which first inserts the
<html> tag and afterwards the
<body> tag. Inserting the
<body> tag directly within the
<html> tag also creates an (empty)
<head> element. Finally at the end of the file everything is closed, which implies that our
<pre> node is also closed as it should be.
There are hard edge cases, which are quite suitable to test the state of the tree constructor. The following is a good test for finding out if the "Heisenberg algorithm" is working correctly and invoked in case of non-conforming usage of tables and anchor tags. The invocation should take place on inserting another anchor element.
<a href="a">a<table><a href="b">b</table>x
The resulting HTML DOM tree is given by the following snippet (without
<body> etc. tags):
Here we see that the character b is taken out of the
<table>. The hyperlink has therefore to start before the table and continue afterwards. This results in a duplication of the anchor tag. All in all those transformations are non-trivial.
Tables are responsible for some edge cases. Most of the edge cases are due to text having no cell environment within a table. The following example demonstrates this:
Here we have some text that does not have a
<th> parent. The result is the following:
The whole text is moved before the actual
<table> element. Additionally, since we have a
<tr> element being defined, but neither
<tbody> section is inserted.
Of course there is more than meets the eye. A big part of the validate HTML5 parsing goes into error correction and constructing tables. Also the formatting elements have to fulfill some rules. Everyone who is interested in the details should take a look at the code. Even though the code might not be as readable as usual LOB application code, it should still be possible to read it with the appropriate comments and the inserted regions.
A very important point was to integrate unit tests. Due to the complicated parser design most of the work has not been dictated by the paradigm of TDD, however, in some parts tests have been placed before any line of code has been written. All in all it was important to place a wide range of unit tests. The tree constructor of the HTML parser is one of the primary goals of the testing library.
Also the DOM objects have been subject to unit tests. The main objective here was to ensure that these objects are working as expected. This means that errors are only thrown on the defined illegal operations and that integrated binding capabilities are functional. Such errors should never occur during the parsing process, since the tree constructor is expected to never try an illegal operation.
Another testing environment has been set up with the AzureWebState project, which aims to crawl webpages from a database. This makes it easy to spot a severe problem with the parser (like
OutOfMemoryException) or potential performance issues.
Reliability tests are not the only kind of tests we are interested in. If we need to wait too long for the parsing result we might be in trouble. Modern web browsers require between 1ms and 100ms for webpages. Hence everything that goes beyond 100ms has to be optimized. Luckily we have some great tools. Visual Studio 2012 provides a great tool for analyzing performance, however, for me in some scenarios PerfView seems to be the best choice (it works across the whole machine and is independent of VS).
A quick look at the memory consumption gives us some indicators that we might want to do something about allocating all those
HtmlCharacterToken instances. Here a pool for character tokens could already be very beneficial. However, a first test showed, that the impact on performance (in terms of speed of processing) is negligible.
There are already some CSS parser out there, some of them written in C#. However, most of them make just a really simple parsing, without evaluating selectors or ignore the specific meaning of a certain property or value. Also most of them are way below CSS3 or do not support any @-rule (like namespace, import, ...) at all.
Since HTML is using CSS as its layout / styling language it was quite natural to integrate CSS directly. There are several places where this has been proven to be very useful:
- Selectors are required for methods like
- Every element can have a
style attribute, which has a non-string DOM representation.
- The stylesheet(s) is / are considered by the DOM directly.
<style> element has a special meaning for the HTML parser.
At the moment external stylesheets will not be parsed directly. The reason is quite simple: AngleSharp should require the least amount of external references. In the most ideal case AngleSharp should be easy to port (or even to exist) as a portable class library (where the intersection would be between "Metro", "Windows Phone" and "WPF"). This might not be possible at the moment, due to using
TaskCompletitionSource at certain points, but this is actually the reason why the whole library is not decorated with
Task instances or even
async keywords all over the place.
The CSS tokenizer is not as complicated as the HTML one. What makes the CSS tokenizer somewhat complex is that it has to handle a lot more types of tokens. In the CSS tokenizer we have:
- String (either in single or double quotes)
- Url (a string in the
- Hash (mostly for selectors like #abc or similar, usually not for colors)
- AtKeyword (used for @-rules)
- Ident (any identifier, i.e. used in selectors, specifiers, properties or values)
- Function (functions are mostly found in values, sometimes in rules)
- Number (any number like 5 or 5.2 or 7e-3)
- Percentage (a special kind of dimension value, e.g. 10%)
- Dimension (any dimensional number, e.g. 5px, 8em or 290deg)
- Range (range values create a range of unicode values)
- Cdo (a special kind of open comment, i.e.
- Cdc (a special kind of close comment, i.e.
- Column (personally I've never seen this in CSS:
- Delim (any delimiter like a comma or a single hash)
- IncludeMatch (the include match
~= in an attribute selector)
- DashMatch (the dash match
|= in an attribute selector)
- PrefixMatch (the prefix match
^= in an attribute selector)
- SuffixMatch (the suffix match
$= in an attribute selector)
- SubstringMatch (the substring match
*= in an attribute selector)
- NotMatch (the not match
!= in an attribute selector)
- RoundBracketOpen and RoundBracketClose
- CurlyBracketOpen and CurlyBracketClose
- SquareBracketOpen and SquareBracketClose
- Colon (colons separate names from values in properties)
- Comma (used to separate various values or selectors)
- Semicolon (mainly used to end a declaration)
- Whitespace (most whitespaces have only separation reasons - meaningful in selectors)
The CSS tokenizer is a simple stream based tokenizer, which returns an iterator of tokens. This iterator can then be used. Every method in the
CssParserclass takes such an iterator. The great advantage of using iterators is that we can basically use any token source. For instance we could use another method to generate a second iterator based on the first one. This method would only iterate over a subset (like the contents of some curly brackets). The great advantage is that both stream advance, but we do not have to proceed in a very complicated token management.
Hence appending rules is as easy as the following code snippet:
void AppendRules(IEnumerator<CssToken> source, List>CSSRule> rules)
Here we just ignore some tokens. In the special case of an at-keyword we start a new @-rule, otherwise we assume that a style rule has to be created. Style rule start with a selector as we know. A valid selector makes more constraints on the possible input tokens, but in general takes any tokens as input.
Quite often we want to skip any whitespaces to come from the current position to the next position. The following snippet allows us to do that:
static Boolean SkipToNextNonWhitespace(IEnumerator<CssToken> source)
if (source.Current.Type != CssTokenType.Whitespace)
Additionally we also get the information if we reached the end of the token stream.
The stylesheet is then created with all the information. Right now special rules like the
CSSImportRule are parsed correctly but ignored afterwards. This has to be integrated at some point in the future.
Additionally we only get a very generic (and meaningless) property called
CSSProperty. In the future the generic property will only be used for unknown (or obsolete) declarations, while more specialized properties will be used for meaningful declarations like
color: #f00 or
font-size: 10pt. This will then also influence the parsing of values, which must take the required input type into considertion.
Another point is that CSS functions (besides
url()) are not included yet. However, these are quite important, since the
attr() functions are getting used more and more these days. Additionally
hsla() or others are mandatory.
Once we hit an at-rule we basically need to parse special cases for special rules. The following code snippet describes this:
CSSRule CreateAtRule(IEnumerator<CssToken> source)
var name = ((CssKeywordToken)source.Current).Data;
case CSSMediaRule.RuleName: return CreateMediaRule(source);
case CSSPageRule.RuleName: return CreatePageRule(source);
case CSSImportRule.RuleName: return CreateImportRule(source);
case CSSFontFaceRule.RuleName: return CreateFontFaceRule(source);
case CSSCharsetRule.RuleName: return CreateCharsetRule(source);
case CSSNamespaceRule.RuleName: return CreateNamespaceRule(source);
case CSSSupportsRule.RuleName: return CreateSupportsRule(source);
case CSSKeyframesRule.RuleName: return CreateKeyframesRule(source);
default: return CreateUnknownRule(name, source);
Let's see how the parsing for the
CSSFontFaceRule is implemented. Here we see that we push the font-face rule to the stack of open rules for the duration of the process. This ensures that every rule gets the right parent rule assigned.
CSSFontFaceRule CreateFontFaceRule(IEnumerator<csstoken> source)
var fontface = new CSSFontFaceRule();
fontface.ParentStyleSheet = sheet;
fontface.ParentRule = CurrentRule;
if(source.Current.Type == CssTokenType.CurlyBracketOpen)
var tokens = LimitToCurrentBlock(source);
Additionally we use the
LimitToCurrentBlock method to stay within the current curly brackets. Another thing is that we re-use the
AppendDeclarations method to append declarations to the given font-face rule. This is no general rule, since e.g. a media rule will contain other rules instead of declarations.
A very important testing class is represented by CSS selectors. Since these selectors are used on many occasions (in CSS, for querying the document, ...) it was very important to include a set of useful unit tests. Luckily the guys who maintain the Sizzle Selector engine (which is primarely used in jQuery) solved this problem already.
These tests look like the following three samples:
public void IdSelectorWithElement()
var result = RunQuery("div#myDiv");
public void PseudoSelectorOnlyChild()
public void NthChildNoPrefixWithDigit()
var result = RunQuery(":nth-child(2)");
So we compare known results with the result of our evaluation. Additionally we also care about the order of the results. This means that the tree walker is doing the right thing.
The whole project would be quite useless without returning an object representation of the given HTML source code. Obviously we have two options:
- Defining our own format / objects
- Using the official specification
Due to the project's goal the decision was quite obvious: The created objects should have a public API that is identical / very similar to the official specification. Users of AngleSharp will therefore have several advantages:
- The learning curve is non-existing for people who are familiar with the DOM
- Users who are not familiar with the HTML DOM will also learn something about the HTML DOM
- Other users will probably learn something as well, since everything can be accessed by intellisense
The last point is quite important here. A huge effort of the project went into (beginning to do a little bit of) writing something that represents a suitable documentation of the whole API and functions. Therefore enumerations, properties and methods, along with classes and events are documented. This means that a variety of learning possibilities is available.
Additionally all DOM objects will be decorated with a special kind of attribute, called
DOMAttribute or simply
DOM. This attribute could help to find out which objects (additionally to the most common types like
The attribute also decorates properties and methods. A special kind of property is an indexer. Most indexers are named
The basic DOM structure is displayed in the next figure.
It was quite difficult to find a truly complete reference. Even though the W3C creates the official standard, it is often in contradiction with itself. The problem is that the current specification is DOM4. If we take a look into any browser we will see that either not all elements there are available, or that additionally other elements are available. Using DOM3 as a reference points makes therefore more sense.
AngleSharp tries to find the right balance. The library contains most of the new API (even though not everything is implemented right now, e.g. the whole event system or the mutation objects), but also contains everything from DOM3 (or previous versions) that has been implemented and used across all major browsers.
The whole project has to be designed with performance in mind, however, this means that sometimes not very beautiful code could be found. Also everything has been programmed as close as possible to the specification, which has been the primary goal. The first objective was to apply the specification and create something that is working. After this has been archieved some performance optimization has been applied. In the end we can see that the whole parser is actually quite fast compared to the ones known from the big browsers.
A big performance issue is the actual startup time. Here the JIT process is not only compiling the MSIL code to machine code, but also performing (necessary) optimiztions. If we start some sample runs we can immediately see that the hot path are not optimized at all. The next screenshot shows a typical run.
- The performance of our CSS tokenizer.
- The performance of our Selector creator.
- The performance of our tree walker.
- The reliability of our CSS Selectors.
- The reliability of our node tree.
Caution This result should not convince you that C# / our implementation is faster than Opera / any browser, but that the performance is at least in a solid area. It should be noted that browsers are usually much more streamlined and probably faster, however, the performance of AngleSharp is quite acceptable.
In total we can say that the performance is already quite OK, even though no major efforts have been put into performance optimization. For documents of modest sizes we will be certainly far below 100ms and eventually (enough warm-up, document size, CPU speed) come close enough to 1ms.
Using the code
The easiest way to get AngleSharp is by using NuGet. The link to the NuGet package is at the end of the article (or just search for AngleSharp in the NuGet package manager offical feed).
The solution that is available on the GitHub repository also contains a WPF application called Samples. This application looks like the following image:
Every sample uses the
HTMLDocument instance in another way. The basic way of getting the document is quite easy:
async Task LoadAsync(String url, CancellationToken cancel)
var http = new HttpClient();
var uri = Sanitize(url);
var request = await http.GetAsync(uri);
var response = await request.Content.ReadAsStreamAsync();
var document = DocumentBuilder.Html(response);
At the moment four sample usages are described. The first is a DOM-Browser. The sample creates a WPF treeview that could be navigated through. The
TreeView control contains all enumerable children and DOM properties of the document. The document is the
HTMLDocument instance that has been received from the given URL.
Reading out these properties can be archieved with the following code. Here we assume that
element is the current object in the DOM tree (e.g. the root element of a document like the
HTMLHtmlElement or attributes like
var type = element.GetType();
var typeName = FindName(type);
var properties = type.GetProperties(BindingFlags.Public | BindingFlags.Instance | BindingFlags.GetProperty)
.Where(m => m.GetCustomAttributes(typeof(DOMAttribute), false).Length > 0)
.OrderBy(m => m.Name);
foreach (var property in properties)
children.Add(new TreeNodeViewModel(property.GetValue(element), FindName(property), this));
if (element is IEnumerable)
var collection = (IEnumerable)element;
var index = 0;
var idx = new object;
foreach (var item in collection)
idx = index;
children.Add(new TreeNodeViewModel(item, "[" + index.ToString() + "]", this));
Hovering over an element that does not contain items usually yields its value (e.g. a property that represents an
int value would display the current value) as a tooltip. Next to the name of the property the exact DOM type is shown. The following screenshot shows this part of the sample application.
The renderer sample might sound interesting in the beginning, but in fact it just uses the WPF
FlowDocument in a very rudimentary way. The output is actually not very readable and far away from the rendering that is done in other solutions (e.g. the HTMLRenderer project on CodePlex).
Nevertheless the sample shows how one could use the DOM to get information about various types of objects and use their information. As a little gimmick
<img> tags are renderer as well, putting at least a little bit of color into the renderer. The screenshot has been taken while being on the English version of the Wikipedia homepage.
Much more interesting is the statistics sample. Here we gather data from the given URL. There are four statistics available, which might be more or less interesting:
- The top-8 elements (most used)
- The top-8 class names (most used)
- The top-8 attributes (most used)
- The top-8 words (most used)
The core of the statistics demo is the following snippet:
void Inspect(Element element, Dictionary<String, Int32> elements, Dictionary<String, Int32> classes, Dictionary<String, Int32> attributes)
foreach (var cls in element.ClassList)
foreach (var attr in element.Attributes)
foreach (var child in element.Children)
Inspect(child, elements, classes, attributes);
This snippet is first used on the root element of the document. From this point on it will recursively call the method on its child elements. Later on the dictionaries can be sorted and evaluated using LINQ.
Additionally we perform some statistics on the text content in form of words. Here any word has to be at least 2 letters. For this sample OxyPlot has been used to display the pie charts. Obviously CodeProject likes to use anchor tags (who doesn't?) and a class called t (in my opinion very self-explanatory name!).
The final sample shows the usage of the DOM method
querySelectorAll. Following the C# naming convention here use it like
QuerySelectorAll. The list of elements is filtered as one enteres the selector in the
TextBox element. The background color of the box indicates the status of the query - a red box tells us that an exception would be thrown due to a syntax error in the query.
The code is quite easy. Basically we take the
document instance and call the
QuerySelectorAll method with a selector string (like
var elements = document.QuerySelectorAll(query);
foreach (var element in elements)
Result = elements.Length;
Finally we take the list of elements (
QuerySelectorAll gives us an
HTMLCollection (which is a list of
Element instances), while
QuerySelector only returns one element or
null) and push it to the observable collection of the viewmodel.
Points of Interest
I think that having a well-maintained DOM implementation in C# is definitely something nice to have for the future. I am currently busy doing other things, but this is a kind of project I will definitely pursue for the next couple of years.
This being said I hope that I could gain a little bit of attention and that some folks would be interested in committing some code to the project. It would be really nice to get a nice and clean (and as perfect as possible) HTML parser implementation in C#.
The whole work would not have been possible without outstanding documentation and specification supplied by the W3C. I have to admit that some documents seem to be not very useful or just outdated, while others are perfectly fine and up to date. It's also very important to question some points there, since (mostly only very small) mistakes can be found as well.
This is a list of the my personal most used (W3C and WHATWG) documents:
Of course there are several other documents that have been useful (all of them supplied by the W3C or the WHATWG), but they list above is a good starter.
Additionally the following links might be helpful:
- v1.0.0 | Initial Release | 19.06.2013