DetectNesting - Nesting aware language processor - Part I

Kamal Shankar

3.76/5 (8 votes)

Jul 26, 2004

14 min read

31795

387

This is the first introductory part of a series which presents a light weight, nesting aware document processor which understands regular expressions too ! But don't get scared with the opening graph !

DOT Source for the graph - 2 Kb

(Click for a circular layout).

This is about what you will be reading. Dare to ?

The text file using which the above graph was drawn is called TagList.cpp and is packaged with the source in Part 2 of this series.

Optional Prerequisites

Following of this series, specially this article will be much easier for those who know even a little about grammar and language theory (not the Queen's language or Wren and Martin kind, but the Computer Science kind - often taught with Theory of Computation ;) ).

Otherwise, if you can make sense of things like production rules or pushdown automata, then you should be editing this article !

Those interested in a working program, and code they can use without going into theory, first try here.

Introduction

This article demonstrates a simple class capable of parsing a language consisting of usual nested statements within non-ambiguous opening and closing tags.

This means that this class is capable of processing languages which have at least the following characteristics:

non-ambiguous opening and closing tags
Statements within such tags
Multi - level statements in nested form

As an example through out this article, in order to explain what our class is doing, our choice of language will be HTML code as it is generally known and understood, and has grammar common to most modern day languages.

This however, does not mean that this class is restricted to parsing HTML ! No - I will just show how this class parses language, by using HTML as an example.

As obvious now, the language specified is synonymous with most of what we use today, for example :

C/C++ has { and } to identify blocks
XML has <..> and </..> to identify sections
You start your e-mail with a "hi" and end with "bye" (you get the idea..)

However, even now, we have very little interest in developing software which can do something more than plain old text searching, without building MBs of metadata.

This class has been made keeping in mind that the structure (layout) of the text being processed itself is enough information to allow for intelligent analysis. At least, sufficient for our everyday needs.

In the process, you will come across a very powerful processor which offers bucket loads of features. Please read Part 2 of this series for the implementation.

What's in the text ?

Remember what I said at point 1 ? non-ambiguous opening and closing tags. Now when we say tag, we usually associate a few characters long string.

However, in our case, a tag can mean any amount of text - a whole paragraph to the whole of the Gift of the Magi ! What do I care, my class supports it, and it will only consume ~32 bytes in each case !

Anyways, as I said, most of the time, we deal with documents/text which share a layout/structure ("language") as that highlighted at the beginning. So when we go back searching for that information, we really recall a little of that structure or what was "in there" and try to feed that information into our search tool to make it get the stuff for us.

Our common search tools usually do a simple "string match" based on the information fed and turns up many matches besides the correct one. But then, we end up wasting time going through that haystack before we get to retrieve the needle !

The uniqueness of this class is that it inherently understands the document structure and seeks out the target accurately.

That is done by identifying the different tags in the document, building a layout of these tags in memory so as to mimic the document itself without loading the whole of it at all, and representing all other information in the document relative to these tags.

This also gives us the added advantage of being able to put the special meaning conveyed by these tags to use.

No more searching for the needle in the haystack. Period.

This class allows you to study and process the information offered by the structure of the document itself besides what the text reads. We want to leverage more than what characters spell out in the files - we put the structure of the text to use !

Terms used in this article (and the series)

But before we can proceed, let's clear out a few "terms" to be used for the rest of the article.

non-ambiguous opening and closing tags : 'Tags' are anything with which you may want to associate a special meaning for a particular purpose.
For example, HTML 'tags' do not show up on your browser - they have a special meaning - they define the layout.
- "opening" and "closing" tags : "Opening tags" are those which signal the start of a new (sub)sequence. "Closing tags" are those which signal the closure of such. For example, the HTML spec. requires that a valid document open with <html> and close with </html>
- "non-ambiguity" : This means that the above tags are unique enough so that the parser/processor can easily make out an opening tag from a closing one, without a complex state table. For example, the <b> and </b> are non-ambiguous, and a simple string match can identify the former as an opening tag and the latter as an closing tag.
  (As an aside, the " token which is usually used to denote a string is ambiguous, as depending on the current state of the FSM, it may denote an opening tag or a closing tag.
  For example, in "Hello", the " followed by the 'H' is an opening tag, whereas, the " following the 'o' is the closing tag).
statement : Any stuff that's not a tag.
nesting : Enclosing tags within others is called nesting. For example, the pair <Tag1>, </Tag1> is nested within <Tag>, </Tag> in the following segment :
```
<Tag>
 <Tag1>...</Tag1>
</Tag>
```
Thus, we will call <Tag>, </Tag> as the root tag pair and <Tag1>, </Tag1> as the sub tag pair. In a similar vein, <Tag>, </Tag> are at nesting level of 0, whereas, <Tag1>, </Tag1> are at a nesting level of 1.
Our class can handle infinite levels of nesting.

Background

At this juncture, it maybe beneficial to discuss the necessary data structures we will use to represent the document's information.

First of all, we need to represent the tags themselves just as they are laid out in the original document. Thus our data structure should be as much abstract and capable of representing any type of textual infomation.

Due to the nature of multi level nested tags, I thought it needful to represent the tag layout using a tree like data structure.

Why a tree ?
For starters, let's consider we are to process the following snippet :
```
<table>.
<table>..</table>

<table>...</table>
<table>....<table>.....</table></table>
</table>
```
Note that the <table> tag followed by single dot contains all the other tags. So we call it the root tag. Hence, it's nesting level is zero.
Similarly, the <table> containing '.....' is nested within a <table> which is further nested within the root <table> tag. Hence, it's said to be nested at level two.
Now let's try to graphically represent this information using a tree (or a forest really..) hierarchically :

Representing the example HTML using a tree.
As actually used in our program. (The gray cluster is actually the double linked list)
DOT Source

Representation of the example HTML using the conventional tree.
A directed graph has been used.
DOT Source

Both the graphs represent the same information.
The number of dots in the nodes correspond directly to that of the text, whereas the numbers in brackets correspond to the offsets of the opening tags (i.e. the <table> tag) in the text. The gray cluster denotes the tag set which are at the same nesting level. Those of you who know about trees, will like to note that each double linked list represent the siblings under each group.
So, just by looking at the graph, we can immediately make out the layout of the tags. We can at once infer that :
- The root node is the tag which contains a single dot, and starts at offset 0.
- There are 3 tags nested directly under it (and hence those tags are at nesting level 1)
- .. and one of the above three (i.e. the tag at offset 111) further contains a tag, nested within it, which starts at 122.
Similarly, using such a data structure will help writing code which can easily infer this data by itself - just by looking at itself !

Next, we come to the issue of aggregating data which will allow us to gauge the nesting level of each tag, and how to determine which closing tag couples with which opening tag. I mean, we just cannot take any opening tag and couple it with the closing tag which comes next !

This is where a stack like data type comes handy.

In fact, those of you who know about pushdown automata will appreciate how the stack/pushdown store can help us recognize grammar of the form aⁿbⁿ.

Now just reflect a minute - cannot we approximate nested tags of the form (opening tag)ⁿ(closing tag)ⁿ ?

The Stack - a powerful tool for detecting nesting depth
Everybody is introduced to two operations on their very first lecture on the stack - yes - Push and Pop.
However, the stack I will be using comes with a little twist - it actually uses a double linked list to simulate the stack, and on a Pop operation, data is not actually popped off, but the stack header just points to the next proper node, just as it would have for a real stack.
But anyways, the net result is that it acts like a stack :=) What we do is, everytime we come across an opening tag, we Push the tag information down the stack, and everytime we come across a closing tag we complete the tag information of the node currently at the stack head (which was pushed earlier), and Pop it (provided the stack was not empty).
What benefit does this approach give us ? Well, mainly :
- We can easily find out if the tags in the document are properly nested.
  if, after all tags have been parsed, we still do not have an empty stack, means that there was atleast onee opening tag which did not have a corresponding closing tag.
  Such a case will arise for the following :
```
<html><html><html><html>
<body></html></html></html>
```
  As we can see, there are more opening <html> tags than closing </html> tags. Thus, the stack will not be empty after all of the document has been parsed.
  On the other hand, if popping of an empty stack was attempted, it implies that there is atleast one closing tag which does not have a corresponding opening tag.
  Such a case will arise for the following :
```
<html></html><body></html>
```
  As we can see, the last closing </html> tag has no matching opening tag. Thus, when the program comes across this text, it will attempt to pop the empty stack - flagging an error.

For detailed discussions, see Part 2.

Not all tags are born equal

Ah well, not everyone gets the same treatment you know..

Same is applicable here. However, to make complete sense of what I shall say, let's first consider the structure of nodes of the forest.

Each node in the forest holds information about each tag pair. In our implementation, each node is represented using an object of the structure TagInformation

The following is the definition :

typedef struct structTagInformation
{
 DWORD dwOpeningTagPosition;
 DWORD dwOpeningTagLen;
 DWORD dwClosingTagPosition;
 DWORD dwClosingTagLen;
 BOOL bIsGrouped; //Denotes if this tag is grouped/complete/valid 
//for 'dwStartTagPosition' and 'dwEndTagPosition'.
 DWORD dwEnclosedTagNumber; //Denotes the number of tags that 
//this tag itself encloses. This should be 0 for the deepest 
//nested tag set, or if it's the one and only tag set
 DWORD dwNestingDepth; //Denotes the nesting depth of the tag. 
//Root tags have this value as 0. Defaults to '-1'

 CTagList* pctlSubTags; //A valid pointer to a CTagList object
//if the above >0. This leads down the forest.
...
} TagInformation;

A quick picture will do :

Here is a sample instance of structure TagInformation :

Layout of TagInformation. DOT source.

The field names are the struct members.

In the picture above, we can see an instance of this structure holding values. Now let's interpret that:

dwStartTagPosition is 0. This means that this tag occurs at the start of the text.
The dwNestingDepth value is also 0. This means that this tag is also a root tag.
It's dwNestingDepth is 44. This means that this tag encloses 4 tags under it.
The bIsGrouped is true. Hence, we can trust that the information held by this node is complete.
- opening tag starts at 0 (the dwOpeningTagPosition value is 0)
- .. is 7 characters long (dwOpeningTagLen is 7)
- .. it's corresponding closing tag which starts at 152 (the dwClosingTagPosition value is 152)
- .. is 8 characters long (dwClosingTagLen value is 8).

If you have read closely, you will realize that this is exactly the description of the first tag (and the corresponding closing tag). Don't believe me ?

Here's a visual. Now note that the graph matches with the text as outlined below :

What's the deal about tags being different ?

Well, now that we have studied the guts of TagInformation, we must understand that not all tags require such detailed representation. From now on, we will call such tags as "no action" tags.

"No action" tag pairs are those tags within which any (text) processing is skipped, and even if it's done, the results should be ignored.

For example, if you comment out a code using (/*, */) or (#if 0,#endif), and you are searching for bugs in some code, you would really not want to get matches from the commented code (as usually would happen with simple searches), but from the active code !

(Yes, you may also resort to Regular Expressions - that too is fully supported by this processor. However, see Part 3 of this series to understand the difficulties posed with regular expressions to parse nesting information).

But, really - do we need to have as many members as TagInformation has for these "no action" tags ?

Well, I thought not to, and created another specalized data structure called structTagListLite which is a "lite" version of structTagList (the double linked list with TagInformation instances as nodes).

This data structure is defined as :

struct structTagListLite
{
....
 DWORD dwOpeningTagPosition;
 DWORD dwClosingTagPosition;
 DWORD dwOpeningTagLen;
 DWORD dwClosingTagLen;
....
};

You must have noticed the liteness of this definition compared to TagInformation - for one thing, it has no provision for holding information about nested tags.

This is because storing information of "no action" tags indented/nested within a "no action" tag makes no sense, as by definition, we should "ignore" anything that lies between a "no action" tag pair.

To elucidate, let's assume "no action" tags to be comments :

For HTML ("no action" tag pair for the following is [""]):

<!--
This is the first comment
 <!-- This is another one nested within the first -->
And now we close the first comment
-->

Storing information about the second nested comment tag is redundant as anything within the first tag is also a comment be default.

For C/C++ ("no action" tag pair for the following is ["//","\n\r*"]):

//The following code will find out the length of a C string in 'i'
for(int i = 0; szA[i++];);
//for(int i; szA[i++];); //This code is just wrong, 
//so I commented it out and wrote the correct one above

Now, regular search routines will give us both 'for' as matches, though the second 'for' is in a "no action" region. This is not usually what you want !

That's why till now, you have had to just wade through, ignoring such useless matches, and painfully finding out what you wanted. No more searching within searches!

You will surely appreciate this feature of "no action" tags more when you work with more complex projects, where large chunks of code may be within comments !

To support argument against supporting nesting information for "no action" tags, I would like to say that, in some languages like C/C++, nested comments are illegal.

In others, it's redundant.

Thus, by removing provisions for subtags, we increase efficiency dramatically, as well as decrease complexity. Hence a separate, specialized class.

There is no cause for alarm though - I do realize that taking information/text inside "no action" tags might be useful in certain situations. That's why I have included the ability to toggle the processing of data within "no action" tags, as and when required. So how about that !

Now you are ready for Part 2

What are you waiting for ? See Part 2 for the implementation of the ideas developed here. A nice MFC GUI Demo is waiting for you !

Credits

The initial draft was written by me, Subhobroto Sinha.

Thanks to the folks at CodeProject, specially Kamal Shankar (who had submitted articles on text processing before), I figured out that article writing was a lot different than arbitrarily jotting down your thoughts in KWrite ;) Because of him you can enjoy the MFC GUI Demo presented at Part 2 of this series, plus less bugs in the forest traversal routines !

This revision is wildly different (a lot shorter - the original draft had redundant state diagrams and tables all over the place) and more professional looking.

Thanks to those who suggested the original big draft to be broken into two or more parts..

History

2004-08-20 : Second revision.
2004-07-18 : First revision.