Click here to Skip to main content
6,595,444 members and growing! (17,727 online)
Email Password   helpLost your password?
General Programming » Algorithms & Recipes » Parsers     Intermediate License: The BSD License

Fast and Compact HTML/XML Scanner/Tokenizer

By c-smile

HTML/XML scanner/tokenizer, also known as a pull parser
VC6, VC7, VC7.1, VC8.0Win2K, WinXP, Win2003, Vista, TabletPC, Embedded, Visual Studio, Dev
Posted:11 May 2006
Updated:10 Oct 2007
Views:92,877
Bookmarked:75 times
Announcements
Loading...
 
Search    
Advanced Search
Add to IE Search
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
25 votes for this article.
Popularity: 6.68 Rating: 4.78 out of 5

1

2
1 vote, 4.0%
3
2 votes, 8.0%
4
22 votes, 88.0%
5

Introduction

The proposed code is an implementation of an HTML and XML scanner (or tokenizer). Imagine that you have some XML or HTML text and you just need to find some word, tag or attribute in it. For such trivial tasks, the use of a full-blown "DOM compiler" or SAX alike parser is too much. It is enough to use the markup::scanner described below. Features of markup::scanner include:

  1. It does not allocate any memory while scanning, at all.
  2. It is written in pure C++ and does not require STL or any other toolkit/library.
  3. It is fast. We managed to reach a speed of scanning nearly 40 MB of XML per second (depends on the hardware you have, of course).
  4. It is simple.

How to Use

I think the best way to explain is to show an example. First, we need to declare the input stream for the scanner. Here is an example of a simple string-based stream:

struct str_istream: public markup::instream
{
    const char* p;
    const char* end; 
    str_istream(const char* src): p(src), end(src + strlen(src)) {}
    virtual wchar_t get_char() { return p < end? *p++: 0; }
};

This is all that we need in order to write the program which will, let's say, print out all of the tokens in the input HTML:

int main(int argc, char* argv[])
{
    str_istream si("<html><body><p align=right" 
        " dir='rtl'> Begin &amp; back </p>" "</body></html>");
    markup::scanner sc(si);
    bool in_text = false;
    while(true)
    {
        int t = sc.get_token();
        switch(t)
        {
            case markup::scanner::TT_ERROR:
                printf("ERROR\n");
                break;
            case markup::scanner::TT_EOF:
                printf("EOF\n");
                goto FINISH;
            case markup::scanner::TT_TAG_START:
                printf("TAG START:%s\n", sc.get_tag_name());
                break;
            case markup::scanner::TT_TAG_END:
                printf("TAG END:%s\n", sc.get_tag_name());
                break;
            case markup::scanner::TT_ATTR:
                printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value());
                break;
            case markup::scanner::TT_WORD: 
                case markup::scanner::TT_SPACE:
                    printf("{%S}\n", sc.get_value());
                    break;
        }
    }
    FINISH:
        printf("--------------------------\n");
        return 0;
}

As you may see, the main method doing the job here is markup::scanner::get_token(). It scans the input stream and returns the value of markup::scanner::token_type.

enum token_type 
{
    TT_ERROR = -1,
    TT_EOF = 0,

    TT_TAG_START,   // <tag ...

                    //     ^-- happens here

    TT_TAG_END,     // </tag>

                    //       ^-- happens here 

                    // <tag ... />

                    //            ^-- or here 

    TT_ATTR,        // <tag attr="value" >      

                    //                  ^-- happens here   

    TT_WORD,
    TT_SPACE,

    TT_DATA,        // content of following:


    TT_COMMENT_START, TT_COMMENT_END, // after "<!--" and "-->"

    TT_CDATA_START, TT_CDATA_END,     // after "<![CDATA[" and "]]>"

    TT_PI_START, TT_PI_END,           // after "<?" and "?>"

    TT_ENTITY_START, TT_ENTITY_END,   // after "<!ENTITY" and ">"

  
};

According to the value of the token, you can use get_tag_name(), get_value() or get_attr_name() to retrieve the needed information. This is pretty much all you need in order to be able to scan HTML/XML..

In Closing

The given scanner does not address any input stream encoding problems. XML and HTML are dealt with differently with this. A general idea for the cases where you don't know the input encoding up front: your input stream should be smart enough to be able to switch the encoding of the input on the fly. The given scanner was initially created as part of the HTMLayout SDK: a lightweight embeddable HTML rendering component.

History

  • 11 May 2006 - Initial version
  • 12 May 2006 - Article moved
  • 09 June 2006 - Bug fixes and a new VS 2005 project
  • 10 October 2007 - Download updated (bug fixes)

License

This article, along with any associated source code and files, is licensed under The BSD License

About the Author

c-smile


Member
Andrew Fedoniouk.

MS in Physics and Applied Mathematics.
Designing software applications and systems since 1991.

W3C HTML5 Working Group, Invited Expert.

Terra Informatica Software, Inc.
http://terrainformatica.com
Occupation: Founder
Company: Terra Informatica Software
Location: Canada Canada

Other popular Algorithms & Recipes articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 25 of 64 (Total in Forum: 64) (Refresh)FirstPrevNext
GeneralEncoding UTF-8 PinmemberJoakim O'Nils22:51 4 Sep '08  
GeneralRe: Encoding UTF-8 Pinmemberc-smile17:42 11 Sep '08  
GeneralEncoding advise Pinmember_Stilgar_7:12 13 Dec '07  
GeneralRe: Encoding advise Pinmemberc-smile12:56 15 Jan '09  
GeneralThoughts on DOCTYPE ... PinmemberJerry Evans7:00 4 Nov '07  
GeneralRe: Thoughts on DOCTYPE ... Pinmemberc-smile10:39 4 Nov '07  
GeneralRe: Thoughts on DOCTYPE ... PinmemberJerry Evans14:54 4 Nov '07  
GeneralRe: Thoughts on DOCTYPE ... Pinmemberc-smile17:32 4 Nov '07  
GeneralRe: Thoughts on DOCTYPE ... PinmemberJerry Evans2:24 5 Nov '07  
GeneralRe: Thoughts on DOCTYPE ... PinmemberJerry Evans2:36 5 Nov '07  
GeneralRe: Thoughts on DOCTYPE ... Pinmemberc-smile17:26 5 Nov '07  
GeneralExtremely useful. PinmemberJerry Evans13:21 12 Oct '07  
GeneralRe: Extremely useful. Pinmemberc-smile13:37 12 Oct '07  
GeneralURL of latest source code Pinmemberc-smile22:18 7 Oct '07  
Generalnuther bug [modified] Pinmemberdr3d21:10 4 Jul '07  
GeneralRe: nuther bug [modified] Pinmemberc-smile19:19 6 Oct '07  
QuestionNested tags PinmemberNulleh19:10 28 Oct '06  
AnswerRe: Nested tags Pinmemberc-smile21:39 28 Oct '06  
GeneralRe: Nested tags PinmemberNulleh18:10 29 Oct '06  
GeneralProcess information token PinmemberJacob Skjoet7:29 27 Sep '06  
GeneralRe: Process information token Pinmemberosocha0:14 27 Oct '06  
GeneralHow to deal with this condition? [modified] Pinmemberdigitalpump22:26 16 Jun '06  
GeneralRe: How to deal with this condition? [modified] Pinmemberc-smile22:59 16 Jun '06  
GeneralRe: How to deal with this condition? [modified] Pinmemberdigitalpump23:24 16 Jun '06  
GeneralHow to get/set position in a stream? Pinmember_Stilgar_17:33 31 May '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 10 Oct 2007
Editor: Genevieve Sovereign
Copyright 2006 by c-smile
Everything else Copyright © CodeProject, 1999-2009
Web21 | Advertise on the Code Project