(Click for a circular layout).
This is about what you will be reading. Dare to ?
The text file using which the above graph was drawn is called
TagList.cpp and is packaged with the source in Part
2 of this series.
Optional Prerequisites
Following of this series, specially this article
will be much easier for those who know even a little about grammar and language
theory (not the
Queen's language or Wren and Martin kind, but the
Computer Science kind - often taught with Theory of Computation ;) ).
Otherwise, if you can make sense of things like production rules or
pushdown automata, then you should be editing this article !
Those interested in a working program, and code they can use
without going into theory, first try here.
Introduction
This article demonstrates a
simple class capable of
parsing a language consisting of usual nested statements within non-ambiguous
opening and closing tags.
This means that this class is capable of processing languages which have at
least the following characteristics:
- non-ambiguous opening and closing tags
- Statements within such tags
- Multi - level statements in nested form
As an example through out this article, in order to explain what our class is
doing, our choice of language will be HTML code as it is generally known and
understood, and has grammar common to most modern day languages.
This however, does not mean that this class is restricted to parsing
HTML ! No - I will just show how this class parses language, by using
HTML as an example.
As obvious now, the language specified is synonymous with most of what we use
today, for example :
- C/C++ has { and } to identify blocks
- XML has <..> and </..> to identify sections
- You start your e-mail with a "hi" and end with "bye" (you get the idea..)
However, even now, we have very little interest in developing software which
can do something more than plain old text searching, without building MBs of
metadata.
This class has been made keeping in mind that the structure (layout) of
the text being processed itself is enough information to allow for intelligent
analysis. At least, sufficient for our everyday needs.
In the process, you will come across a very powerful processor
which offers bucket loads of features. Please read Part 2 of this series for the implementation.
What's in the text ?
Remember what I said at point 1 ? non-ambiguous opening and closing
tags. Now when we say tag, we usually associate a few characters long
string.
However, in our case, a tag can mean any amount of text - a whole
paragraph to the whole of the Gift of the Magi ! What do I care, my
class supports it, and it will only consume ~32 bytes in each case !
Anyways, as I said, most of the time, we deal with documents/text which share
a layout/structure ("language") as that highlighted at the beginning. So when we
go back searching for that information, we really recall a little of that
structure or what was "in there" and try to feed that information into our
search tool to make it get the stuff for us.
Our common search tools usually do a simple "string match" based on the
information fed and turns up many matches besides the correct one. But then, we
end up wasting time going through that haystack before we get to retrieve the
needle !
The uniqueness of this class is that it inherently understands the
document structure and seeks out the target accurately.
That is done by identifying the different tags in the document,
building a layout of these tags in memory so as to mimic the document itself
without loading the whole of it at all, and representing all other
information in the document relative to these tags.
This also gives us the added advantage of being able to put the special
meaning conveyed by these tags to use.
No more searching for the needle in the haystack. Period.
This class allows you to study and process the information offered by the
structure of the document itself besides what the text reads. We want to
leverage more than what characters spell out in the files - we put the structure
of the text to use !
Terms used in this article (and the series)
But before we can proceed, let's clear out a few "terms" to be used for the
rest of the article.
- non-ambiguous opening and closing tags : 'Tags' are anything with
which you may want to associate a special meaning for a particular
purpose.
For example, HTML 'tags' do not show up on your browser - they have a special
meaning - they define the layout.
- "opening" and "closing" tags : "Opening tags" are those which signal
the start of a new (sub)sequence. "Closing tags" are those which signal the
closure of such. For example, the HTML spec. requires that a valid
document open with <html> and close with </html>
- "non-ambiguity" : This means that the above tags are unique enough so
that the parser/processor can easily make out an opening tag from a closing one,
without a complex state table. For example, the <b> and </b> are
non-ambiguous, and a simple string match can identify the former as an opening
tag and the latter as an closing tag.
(As an aside, the "
token which is usually used to denote a
string is ambiguous, as depending on the current state of the FSM,
it may denote an opening tag or a closing tag.
For example, in "Hello"
, the "
followed by the 'H'
is an opening tag, whereas, the "
following the 'o' is the closing
tag).
- statement : Any stuff that's not a tag.
- nesting : Enclosing tags within others is called
nesting. For example, the pair <Tag1>, </Tag1> is nested
within <Tag>, </Tag> in the following segment :
<Tag>
<Tag1>...</Tag1>
</Tag>
Thus, we will call <Tag>, </Tag> as the root tag
pair and <Tag1>, </Tag1> as the sub tag pair. In a
similar vein, <Tag>, </Tag> are at nesting level of 0,
whereas, <Tag1>, </Tag1> are at a nesting level of 1.
Our class can handle infinite levels of nesting.
Background
At this juncture, it maybe beneficial to discuss the
necessary data structures we will use to represent the document's information.
First of all, we need to represent the tags themselves just as they are
laid out in the original document. Thus our data structure should be as much
abstract and capable of representing any type of textual infomation.
Due to the nature of multi level nested tags, I thought it needful to
represent the tag layout using a tree like data structure.
-
Why a tree ?
For starters, let's consider we are to process the
following snippet :
<table>.
<table>..</table>
<table>...</table>
<table>....<table>.....</table></table>
</table>
Note that the <table>
tag followed by
single dot contains all the other tags. So we call it the root
tag. Hence, it's nesting level is zero.
Similarly, the <table>
containing '.....'
is
nested within a <table>
which is further nested within the
root <table>
tag. Hence, it's said to be nested at level two.
Now let's try to graphically represent this information using a tree (or a
forest really..) hierarchically :
Representing the example HTML using a tree.
As actually used in our program. (The gray cluster is actually the double
linked list)
DOT Source
Representation of the example HTML using the conventional tree.
A directed graph has been used.
DOT Source
Both the graphs represent the same information.
The number of dots in the nodes correspond directly to that of the text,
whereas the numbers in brackets correspond to the offsets of the opening
tags (i.e. the <table>
tag) in the text. The gray cluster
denotes the tag set which are at the same nesting level. Those of you who
know about trees, will like to note that each double linked list represent the
siblings under each group.
So, just by looking at the graph, we can immediately make out the
layout of the tags. We can at once infer that :
- The root node is the tag which contains a single dot, and starts at offset
0.
- There are 3 tags nested directly under it (and hence those tags are
at nesting level 1)
- .. and one of the above three (i.e. the tag at offset 111) further contains
a tag, nested within it, which starts at 122.
Similarly, using such a data structure will help writing code which can
easily infer this data by itself - just by looking at itself !
Next, we come to the issue of aggregating data which will allow us
to gauge the
nesting level of each tag, and how to determine which
closing tag couples with which
opening tag. I mean, we just cannot
take
any opening tag and couple it with the closing tag which comes next
!
This is where a stack like data type comes handy.
In fact, those of you who know about pushdown automata will appreciate
how the stack/pushdown store can help us recognize grammar of the
form anbn.
Now just reflect a minute - cannot we approximate nested tags of the form
(opening tag)n(closing tag)n ?
-
The Stack - a powerful tool for detecting nesting depth
Everybody is
introduced to two operations on their very first lecture on the stack - yes -
Push and Pop.
However, the stack I will be using comes with a little twist -
it actually uses a double linked list to simulate the stack, and
on a Pop operation, data is not actually popped off, but
the stack header just points to the next proper node, just as it would
have for a real stack.
But anyways, the net result is that it acts like a stack :=) What we
do is, everytime we come across an opening tag, we Push the tag
information down the stack, and everytime we come across a closing
tag we complete the tag information of the node currently at the
stack head (which was pushed earlier), and Pop it (provided the
stack was not empty).
What benefit does this approach give us ? Well, mainly :
- We can easily find out if the tags in the document are properly
nested.
if, after all tags have been parsed, we still do not have an empty stack,
means that there was atleast onee opening tag which did not have a
corresponding closing tag.
Such a case will arise for the following :
<html><html><html><html>
<body></html></html></html>
As we can see, there are more opening <html>
tags than
closing </html>
tags. Thus, the stack will not be empty after
all of the document has been parsed.
On the other hand, if popping of an empty stack was attempted, it implies
that there is atleast one closing tag which does not have a corresponding
opening tag.
Such a case will arise for the following :
<html></html><body></html>
As we can see, the last closing </html>
tag has no
matching opening tag. Thus, when the program comes across this text, it will
attempt to pop the empty stack - flagging an error.
For detailed discussions, see Part 2.
Not all tags are born equal
Ah well, not everyone gets the same
treatment you know..
Same is applicable here. However, to make complete sense of what I
shall say, let's first consider the structure of nodes of the forest.
Each node in the forest holds information about each tag pair. In our
implementation, each node is represented using an object of the structure
TagInformation
The following is the definition :
typedef struct structTagInformation
{
DWORD dwOpeningTagPosition;
DWORD dwOpeningTagLen;
DWORD dwClosingTagPosition;
DWORD dwClosingTagLen;
BOOL bIsGrouped;
DWORD dwEnclosedTagNumber;
DWORD dwNestingDepth;
CTagList* pctlSubTags;
...
} TagInformation;
A quick picture will do :
Here is a sample instance of struct
ure
TagInformation
:
Layout of TagInformation. DOT
source.
The field names are the struct
members.
In the picture above, we can see an instance of this structure holding
values. Now let's interpret that:
dwStartTagPosition
is 0
. This means that this tag
occurs at the start of the text.
- The
dwNestingDepth
value is also 0
. This
means that this tag is also a root tag.
- It's
dwNestingDepth
is 44
. This means that this
tag encloses 4 tags under it.
- The
bIsGrouped
is true
. Hence, we can trust that
the information held by this node is complete.
In other words ,this node holds complete details of a tag pair
whose :
- opening tag starts at 0 (the
dwOpeningTagPosition
value is 0)
- .. is 7 characters long (
dwOpeningTagLen
is 7)
- .. it's corresponding closing tag which starts at 152 (the
dwClosingTagPosition
value is 152)
- .. is 8 characters long (
dwClosingTagLen
value is
8).
If you have read closely, you will realize that this is
exactly the description of the first tag (and the corresponding closing
tag). Don't believe me ?
Here's a visual. Now note that the graph matches with the text as outlined
below :
What's the deal about tags being different ?
Well, now that we have
studied the guts of
TagInformation
, we must understand that
not
all tags require such detailed representation. From now on, we will call
such tags as
"no action" tags.
"No action" tag pairs are those tags within which any (text)
processing is skipped, and even if it's done, the results should be ignored.
For example, if you comment out a code using (/*
,
*/
) or (#if 0
,#endif
), and you are
searching for bugs in some code, you would really not want to get matches from
the commented code (as usually would happen with simple searches), but from the
active code !
(Yes, you may also resort to Regular Expressions - that too is fully
supported by this processor. However, see Part 3 of this series to understand
the difficulties posed with regular expressions to parse nesting
information).
But, really - do we need to have as many members as
TagInformation
has for these "no action" tags ?
Well, I thought not to, and created another specalized data structure
called structTagListLite
which is a "lite" version of
structTagList
(the double linked list with
TagInformation
instances as nodes).
This data structure is defined as :
struct structTagListLite
{
....
DWORD dwOpeningTagPosition;
DWORD dwClosingTagPosition;
DWORD dwOpeningTagLen;
DWORD dwClosingTagLen;
....
};
You must have noticed the liteness of this definition compared to
TagInformation
- for one thing, it has no provision for holding
information about nested tags.
This is because storing information of "no action" tags indented/nested
within a "no action" tag makes no sense, as by definition, we should "ignore"
anything that lies between a "no action" tag pair.
To elucidate, let's assume "no action" tags to be comments :
For HTML ("no action" tag pair for the following is
["<!--"
,"-->"
]):
<!--
And now we close the first comment
-->
Storing information about the second nested comment tag is redundant as
anything within the first tag is also a comment be default.
For C/C++ ("no action" tag pair for the following is
["//"
,"\n\r*"
]):
for(int i = 0; szA[i++];);
Now, regular search routines will give us both 'for' as matches, though the
second 'for' is in a "no action" region. This is not usually what you want
!
That's why till now, you have had to just wade through, ignoring such useless
matches, and painfully finding out what you wanted. No more searching within
searches!
You will surely appreciate this feature of "no action" tags more when you
work with more complex projects, where large chunks of code may be within
comments !
To support argument against supporting nesting information for "no
action" tags, I would like to say that, in some languages like C/C++, nested
comments are illegal.
In others, it's redundant.
Thus, by removing provisions for subtags, we increase efficiency
dramatically, as well as decrease complexity. Hence a separate, specialized
class.
There is no cause for alarm though - I do realize that taking
information/text inside "no action" tags might be useful in certain
situations. That's why I have included the ability to toggle the
processing of data within "no action" tags, as and when required. So how
about that !
Now you are ready for Part 2
What are you waiting for ? See
Part 2 for the implementation of the ideas developed
here. A nice MFC GUI Demo is waiting for you !
Credits
The initial draft was written by me, Subhobroto Sinha.
Thanks to the folks at CodeProject, specially Kamal Shankar (who had
submitted articles on text processing before), I figured out that article
writing was a lot different than arbitrarily jotting down your thoughts in
KWrite ;) Because of him you can enjoy the MFC GUI Demo presented at Part 2 of
this series, plus less bugs in the forest traversal routines !
This revision is wildly different (a lot shorter - the original draft had
redundant state diagrams and tables all over the place) and more professional
looking.
Thanks to those who suggested the original big draft to be broken into
two or more parts..
History
- 2004-08-20 : Second revision.
- 2004-07-18 : First revision.
He began programming since standard 8, about age of 14.
Looking back, being from a family of engineers, innocent inquisitiveness (and the destruction that comes along with it) came naturally to him (much to the dismay of others around who actually *owned* the property under inspection..).
Perhaps that only can explain why he plugged his father's HiFi's Line-In into the A.C Power socket in order to hear what 60Hz A.C. Might sound like. That research ended prematurely with fireworks originating from the amplifier, and needless to say, a very disturbed father.
That was at the age of 12. And there are reports - isolated incidents - of exploding Coke cans filled with hydrogen obtained from the electrolysis of salted water using a hand made full wave rectifier. Ah, and there are some other things better left unsaid..
Now that he has attained adulthood and realized the folly of his old ways, he just sits tight and writes code as long as he can keep his eyes open.