Click here to Skip to main content
Click here to Skip to main content

Fast and Compact HTML/XML Scanner/Tokenizer

By , 10 Oct 2007
 

Introduction

The proposed code is an implementation of an HTML and XML scanner (or tokenizer). Imagine that you have some XML or HTML text and you just need to find some word, tag or attribute in it. For such trivial tasks, the use of a full-blown "DOM compiler" or SAX alike parser is too much. It is enough to use the markup::scanner described below. Features of markup::scanner include:

  1. It does not allocate any memory while scanning, at all.
  2. It is written in pure C++ and does not require STL or any other toolkit/library.
  3. It is fast. We managed to reach a speed of scanning nearly 40 MB of XML per second (depends on the hardware you have, of course).
  4. It is simple.

How to Use

I think the best way to explain is to show an example. First, we need to declare the input stream for the scanner. Here is an example of a simple string-based stream:

struct str_istream: public markup::instream
{
    const char* p;
    const char* end; 
    str_istream(const char* src): p(src), end(src + strlen(src)) {}
    virtual wchar_t get_char() { return p < end? *p++: 0; }
};

This is all that we need in order to write the program which will, let's say, print out all of the tokens in the input HTML:

int main(int argc, char* argv[])
{
    str_istream si("<html><body><p align=right" 
        " dir='rtl'> Begin &amp; back </p>" "</body></html>");
    markup::scanner sc(si);
    bool in_text = false;
    while(true)
    {
        int t = sc.get_token();
        switch(t)
        {
            case markup::scanner::TT_ERROR:
                printf("ERROR\n");
                break;
            case markup::scanner::TT_EOF:
                printf("EOF\n");
                goto FINISH;
            case markup::scanner::TT_TAG_START:
                printf("TAG START:%s\n", sc.get_tag_name());
                break;
            case markup::scanner::TT_TAG_END:
                printf("TAG END:%s\n", sc.get_tag_name());
                break;
            case markup::scanner::TT_ATTR:
                printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value());
                break;
            case markup::scanner::TT_WORD: 
                case markup::scanner::TT_SPACE:
                    printf("{%S}\n", sc.get_value());
                    break;
        }
    }
    FINISH:
        printf("--------------------------\n");
        return 0;
}

As you may see, the main method doing the job here is markup::scanner::get_token(). It scans the input stream and returns the value of markup::scanner::token_type.

enum token_type 
{
    TT_ERROR = -1,
    TT_EOF = 0,

    TT_TAG_START,   // <tag ...
                    //     ^-- happens here
    TT_TAG_END,     // </tag>
                    //       ^-- happens here 
                    // <tag ... />
                    //            ^-- or here 
    TT_ATTR,        // <tag attr="value" >      
                    //                  ^-- happens here   
    TT_WORD,
    TT_SPACE,

    TT_DATA,        // content of following:

    TT_COMMENT_START, TT_COMMENT_END, // after "<!--" and "-->"
    TT_CDATA_START, TT_CDATA_END,     // after "<![CDATA[" and "]]>"
    TT_PI_START, TT_PI_END,           // after "<?" and "?>"
    TT_ENTITY_START, TT_ENTITY_END,   // after "<!ENTITY" and ">"
  
};

According to the value of the token, you can use get_tag_name(), get_value() or get_attr_name() to retrieve the needed information. This is pretty much all you need in order to be able to scan HTML/XML..

In Closing

The given scanner does not address any input stream encoding problems. XML and HTML are dealt with differently with this. A general idea for the cases where you don't know the input encoding up front: your input stream should be smart enough to be able to switch the encoding of the input on the fly. The given scanner was initially created as part of the HTMLayout SDK: a lightweight embeddable HTML rendering component.

History

  • 11 May 2006 - Initial version
  • 12 May 2006 - Article moved
  • 09 June 2006 - Bug fixes and a new VS 2005 project
  • 10 October 2007 - Download updated (bug fixes)

License

This article, along with any associated source code and files, is licensed under The BSD License

About the Author

c-smile
Founder Terra Informatica Software
Canada Canada
Member
Andrew Fedoniouk.
 
MS in Physics and Applied Mathematics.
Designing software applications and systems since 1991.
 
W3C HTML5 Working Group, Invited Expert.
 
Terra Informatica Software, Inc.
http://terrainformatica.com

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
NewsReused this great librarymemberMember 824907219 Sep '11 - 3:47 
GeneralRe: Reused this great librarymemberc-smile19 Sep '11 - 17:46 
Questionunresolved external symbolmembersupermegapup8 Aug '11 - 6:58 
AnswerRe: unresolved external symbolmemberc-smile8 Aug '11 - 16:06 
GeneralEncoding UTF-8memberJoakim O'Nils4 Sep '08 - 21:51 
GeneralRe: Encoding UTF-8memberc-smile11 Sep '08 - 16:42 
GeneralEncoding advisemember_Stilgar_13 Dec '07 - 6:12 
GeneralRe: Encoding advisememberc-smile15 Jan '09 - 11:56 
GeneralRe: Encoding advisememberohad-oz12 Jul '10 - 14:37 
GeneralThoughts on DOCTYPE ...memberJerry Evans4 Nov '07 - 6:00 
GeneralRe: Thoughts on DOCTYPE ...memberc-smile4 Nov '07 - 9:39 
GeneralRe: Thoughts on DOCTYPE ...memberJerry Evans4 Nov '07 - 13:54 
GeneralRe: Thoughts on DOCTYPE ...memberc-smile4 Nov '07 - 16:32 
GeneralRe: Thoughts on DOCTYPE ...memberJerry Evans5 Nov '07 - 1:24 
GeneralRe: Thoughts on DOCTYPE ...memberJerry Evans5 Nov '07 - 1:36 
GeneralRe: Thoughts on DOCTYPE ...memberc-smile5 Nov '07 - 16:26 
GeneralExtremely useful.memberJerry Evans12 Oct '07 - 12:21 
GeneralRe: Extremely useful.memberc-smile12 Oct '07 - 12:37 
GeneralURL of latest source codememberc-smile7 Oct '07 - 21:18 
Generalnuther bug [modified]memberdr3d4 Jul '07 - 20:10 
GeneralRe: nuther bug [modified]memberc-smile6 Oct '07 - 18:19 
QuestionNested tagsmemberNulleh28 Oct '06 - 18:10 
AnswerRe: Nested tagsmemberc-smile28 Oct '06 - 20:39 
GeneralRe: Nested tagsmemberNulleh29 Oct '06 - 17:10 
GeneralProcess information tokenmemberJacob Skjoet27 Sep '06 - 6:29 
GeneralRe: Process information tokenmemberosocha26 Oct '06 - 23:14 
QuestionHow to deal with this condition? [modified]memberdigitalpump16 Jun '06 - 21:26 
AnswerRe: How to deal with this condition? [modified]memberc-smile16 Jun '06 - 21:59 
GeneralRe: How to deal with this condition? [modified]memberdigitalpump16 Jun '06 - 22:24 
QuestionHow to get/set position in a stream?member_Stilgar_31 May '06 - 16:33 
AnswerRe: How to get/set position in a stream?memberc-smile1 Jun '06 - 13:02 
GeneralRe: How to get/set position in a stream?member_Stilgar_3 Jun '06 - 8:29 
GeneralRe: How to get/set position in a stream? [modified]memberc-smile3 Jun '06 - 10:07 
GeneralRe: How to get/set position in a stream? [modified]member_Stilgar_6 Jun '06 - 5:50 
GeneralRe: How to get/set position in a stream? [modified]memberc-smile8 Jun '06 - 16:48 
GeneralRe: How to get/set position in a stream? [modified]member_Stilgar_9 Jun '06 - 1:02 
GeneralBug in xh_scanner.cppmemberper.larsson17 May '06 - 22:45 
GeneralRe: Bug in xh_scanner.cppmemberc-smile18 May '06 - 18:28 
GeneralRe: Bug in xh_scanner.cppmember_Stilgar_19 May '06 - 0:36 
GeneralMFC wrappermember_Stilgar_12 May '06 - 1:44 
GeneralRe: MFC wrappermemberc-smile12 May '06 - 6:13 
GeneralRe: MFC wrappermemberDarka12 May '06 - 7:05 
GeneralRe: MFC wrappermemberc-smile12 May '06 - 21:19 
GeneralRe: MFC wrappermember_Stilgar_13 May '06 - 10:52 
GeneralRe: MFC wrappermemberc-smile13 May '06 - 11:58 
GeneralRe: MFC wrappermember_Stilgar_13 May '06 - 12:11 
GeneralRe: MFC wrappermemberc-smile14 May '06 - 9:54 
GeneralRe: MFC wrappermember_Stilgar_14 May '06 - 10:04 
GeneralRe: MFC wrappermemberc-smile14 May '06 - 10:29 
GeneralRe: MFC wrappermember_Stilgar_14 May '06 - 10:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 10 Oct 2007
Article Copyright 2006 by c-smile
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid