Click here to Skip to main content
15,887,350 members
Articles / Mobile Apps / Windows Mobile
Article

Fast and Compact HTML/XML Scanner/Tokenizer

Rate me:
Please Sign up or sign in to vote.
4.90/5 (31 votes)
10 Oct 2007BSD2 min read 551.2K   2K   98   71
HTML/XML scanner/tokenizer, also known as a pull parser

Introduction

The proposed code is an implementation of an HTML and XML scanner (or tokenizer). Imagine that you have some XML or HTML text and you just need to find some word, tag or attribute in it. For such trivial tasks, the use of a full-blown "DOM compiler" or SAX alike parser is too much. It is enough to use the markup::scanner described below. Features of markup::scanner include:

  1. It does not allocate any memory while scanning, at all.
  2. It is written in pure C++ and does not require STL or any other toolkit/library.
  3. It is fast. We managed to reach a speed of scanning nearly 40 MB of XML per second (depends on the hardware you have, of course).
  4. It is simple.

How to Use

I think the best way to explain is to show an example. First, we need to declare the input stream for the scanner. Here is an example of a simple string-based stream:

C++
struct str_istream: public markup::instream
{
    const char* p;
    const char* end; 
    str_istream(const char* src): p(src), end(src + strlen(src)) {}
    virtual wchar_t get_char() { return p < end? *p++: 0; }
};

This is all that we need in order to write the program which will, let's say, print out all of the tokens in the input HTML:

C++
int main(int argc, char* argv[])
{
    str_istream si("<html><body><p align=right" 
        " dir='rtl'> Begin &amp; back </p>" "</body></html>");
    markup::scanner sc(si);
    bool in_text = false;
    while(true)
    {
        int t = sc.get_token();
        switch(t)
        {
            case markup::scanner::TT_ERROR:
                printf("ERROR\n");
                break;
            case markup::scanner::TT_EOF:
                printf("EOF\n");
                goto FINISH;
            case markup::scanner::TT_TAG_START:
                printf("TAG START:%s\n", sc.get_tag_name());
                break;
            case markup::scanner::TT_TAG_END:
                printf("TAG END:%s\n", sc.get_tag_name());
                break;
            case markup::scanner::TT_ATTR:
                printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value());
                break;
            case markup::scanner::TT_WORD: 
                case markup::scanner::TT_SPACE:
                    printf("{%S}\n", sc.get_value());
                    break;
        }
    }
    FINISH:
        printf("--------------------------\n");
        return 0;
}

As you may see, the main method doing the job here is markup::scanner::get_token(). It scans the input stream and returns the value of markup::scanner::token_type.

C++
enum token_type 
{
    TT_ERROR = -1,
    TT_EOF = 0,

    TT_TAG_START,   // <tag ...
                    //     ^-- happens here
    TT_TAG_END,     // </tag>
                    //       ^-- happens here 
                    // <tag ... />
                    //            ^-- or here 
    TT_ATTR,        // <tag attr="value" >      
                    //                  ^-- happens here   
    TT_WORD,
    TT_SPACE,

    TT_DATA,        // content of following:

    TT_COMMENT_START, TT_COMMENT_END, // after "<!--" and "-->"
    TT_CDATA_START, TT_CDATA_END,     // after "<![CDATA[" and "]]>"
    TT_PI_START, TT_PI_END,           // after "<?" and "?>"
    TT_ENTITY_START, TT_ENTITY_END,   // after "<!ENTITY" and ">"
  
};

According to the value of the token, you can use get_tag_name(), get_value() or get_attr_name() to retrieve the needed information. This is pretty much all you need in order to be able to scan HTML/XML..

In Closing

The given scanner does not address any input stream encoding problems. XML and HTML are dealt with differently with this. A general idea for the cases where you don't know the input encoding up front: your input stream should be smart enough to be able to switch the encoding of the input on the fly. The given scanner was initially created as part of the HTMLayout SDK: a lightweight embeddable HTML rendering component.

History

  • 11 May 2006 - Initial version
  • 12 May 2006 - Article moved
  • 09 June 2006 - Bug fixes and a new VS 2005 project
  • 10 October 2007 - Download updated (bug fixes)

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Founder Terra Informatica Software
Canada Canada
Andrew Fedoniouk.

MS in Physics and Applied Mathematics.
Designing software applications and systems since 1991.

W3C HTML5 Working Group, Invited Expert.

Terra Informatica Software, Inc.
http://terrainformatica.com

Comments and Discussions

 
NewsReused this great library Pin
Member 824907219-Sep-11 3:47
Member 824907219-Sep-11 3:47 
GeneralRe: Reused this great library Pin
c-smile19-Sep-11 17:46
c-smile19-Sep-11 17:46 
Questionunresolved external symbol Pin
supermegapup8-Aug-11 6:58
supermegapup8-Aug-11 6:58 
AnswerRe: unresolved external symbol Pin
c-smile8-Aug-11 16:06
c-smile8-Aug-11 16:06 
GeneralEncoding UTF-8 Pin
Joakim O'Nils4-Sep-08 21:51
Joakim O'Nils4-Sep-08 21:51 
GeneralRe: Encoding UTF-8 Pin
c-smile11-Sep-08 16:42
c-smile11-Sep-08 16:42 
GeneralEncoding advise Pin
_Stilgar_13-Dec-07 6:12
_Stilgar_13-Dec-07 6:12 
Hi,

This is the code I'm using in the sample app:

struct ascii_file_istream : public markup::instream<br />
{<br />
	FILE *f;<br />
	unsigned int pos;<br />
	ascii_file_istream(const char* filename) : pos(0), f(NULL) { f = fopen(filename, "rb"); }<br />
	virtual wchar_t get_char() { wchar_t c; pos++; return fread(&c,sizeof(wchar_t),1,f)? c : 0; }<br />
	~ascii_file_istream() { fclose(f); }<br />
	bool is_file() { return (!(f==NULL)); }<br />
};<br />
<br />
int main(int argc, char* argv[])<br />
{<br />
  ascii_file_istream fi("c:\\testfile.htm");<br />
<br />
  if (!fi.is_file())<br />
	  return 0;<br />
<br />
  markup::scanner sc(fi);<br />
  bool in_text = false;<br />
  while(true)<br />
  {<br />
    int t = sc.get_token();<br />
    switch(t)<br />
    {<br />
      case markup::scanner::TT_EOF:<br />
        printf("EOF\n");<br />
        goto FINISH;<br />
      case markup::scanner::TT_SPACE:<br />
	printf("SPACE\n");<br />
	break;<br />
      case markup::scanner::TT_WORD:<br />
	{<br />
	  const markup::wchar* w = sc.get_value();<br />
	  printf("WORD: {%S}\n", sc.get_value());<br />
	}<br />
        break;<br />
      // The rest of the cases<br />
      // ...<br />
  }<br />
FINISH:<br />
  printf("--------------------------\n");<br />
  return 0;<br />
}


This code works perfectly well when used to scan English documents. It is not working when I'm using it to scan documents with non-english words. The only way I could make it work is by setting the default charset to Unicode in the project properties, re-saving the file as Unicode with Notepad (UTF8 didn't work to, only Unicode), only then would w in the TT_WORD case got the correct value and not a set of squares. Also note I'm reading wchar_t from the file, not char as you suggested in the original code you posted a while ago. I never got printf (nor wprintf) to output the correct chars to the screen, even when the word was read allright.

My question is that: how can I read the non-english file correctly also when it is saved as UTF8 or ANSI? does the scanner itself aware of those characters, or does it just ignore them so once I'm calling sc.get_value() I can convert it to whatever encoding I need and it will display it fine?
Also, can I still read the file with your code:
virtual wchar_t get_char() { char c; pos++; return fread(&c,1,1,f)? c : 0; }
and still read the non-english words correctly?
The best practice would be for the reader to be indepndant of the file encoding. Is that possible?

Please advise.

Stilgar.
GeneralRe: Encoding advise Pin
c-smile15-Jan-09 11:56
c-smile15-Jan-09 11:56 
GeneralRe: Encoding advise Pin
ohad-oz12-Jul-10 14:37
ohad-oz12-Jul-10 14:37 
GeneralRe: Encoding advise Pin
codcode11-Mar-14 11:02
codcode11-Mar-14 11:02 
GeneralUTF-8 input stream Pin
c-smile12-Apr-14 19:56
c-smile12-Apr-14 19:56 
GeneralThoughts on DOCTYPE ... Pin
Jerry Evans4-Nov-07 6:00
Jerry Evans4-Nov-07 6:00 
GeneralRe: Thoughts on DOCTYPE ... Pin
c-smile4-Nov-07 9:39
c-smile4-Nov-07 9:39 
GeneralRe: Thoughts on DOCTYPE ... Pin
Jerry Evans4-Nov-07 13:54
Jerry Evans4-Nov-07 13:54 
GeneralRe: Thoughts on DOCTYPE ... Pin
c-smile4-Nov-07 16:32
c-smile4-Nov-07 16:32 
GeneralRe: Thoughts on DOCTYPE ... Pin
Jerry Evans5-Nov-07 1:24
Jerry Evans5-Nov-07 1:24 
GeneralRe: Thoughts on DOCTYPE ... Pin
Jerry Evans5-Nov-07 1:36
Jerry Evans5-Nov-07 1:36 
GeneralRe: Thoughts on DOCTYPE ... Pin
c-smile5-Nov-07 16:26
c-smile5-Nov-07 16:26 
GeneralExtremely useful. Pin
Jerry Evans12-Oct-07 12:21
Jerry Evans12-Oct-07 12:21 
GeneralRe: Extremely useful. Pin
c-smile12-Oct-07 12:37
c-smile12-Oct-07 12:37 
GeneralURL of latest source code Pin
c-smile7-Oct-07 21:18
c-smile7-Oct-07 21:18 
Generalnuther bug [modified] Pin
dr3d4-Jul-07 20:10
dr3d4-Jul-07 20:10 
GeneralRe: nuther bug [modified] Pin
c-smile6-Oct-07 18:19
c-smile6-Oct-07 18:19 
QuestionNested tags Pin
Nulleh28-Oct-06 18:10
Nulleh28-Oct-06 18:10 
AnswerRe: Nested tags Pin
c-smile28-Oct-06 20:39
c-smile28-Oct-06 20:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.