Click here to Skip to main content
Click here to Skip to main content

Fast and Compact HTML/XML Scanner/Tokenizer

, 10 Oct 2007 BSD
Rate this:
Please Sign up or sign in to vote.
HTML/XML scanner/tokenizer, also known as a pull parser


The proposed code is an implementation of an HTML and XML scanner (or tokenizer). Imagine that you have some XML or HTML text and you just need to find some word, tag or attribute in it. For such trivial tasks, the use of a full-blown "DOM compiler" or SAX alike parser is too much. It is enough to use the markup::scanner described below. Features of markup::scanner include:

  1. It does not allocate any memory while scanning, at all.
  2. It is written in pure C++ and does not require STL or any other toolkit/library.
  3. It is fast. We managed to reach a speed of scanning nearly 40 MB of XML per second (depends on the hardware you have, of course).
  4. It is simple.

How to Use

I think the best way to explain is to show an example. First, we need to declare the input stream for the scanner. Here is an example of a simple string-based stream:

struct str_istream: public markup::instream
    const char* p;
    const char* end; 
    str_istream(const char* src): p(src), end(src + strlen(src)) {}
    virtual wchar_t get_char() { return p < end? *p++: 0; }

This is all that we need in order to write the program which will, let's say, print out all of the tokens in the input HTML:

int main(int argc, char* argv[])
    str_istream si("<html><body><p align=right" 
        " dir='rtl'> Begin &amp; back </p>" "</body></html>");
    markup::scanner sc(si);
    bool in_text = false;
        int t = sc.get_token();
            case markup::scanner::TT_ERROR:
            case markup::scanner::TT_EOF:
                goto FINISH;
            case markup::scanner::TT_TAG_START:
                printf("TAG START:%s\n", sc.get_tag_name());
            case markup::scanner::TT_TAG_END:
                printf("TAG END:%s\n", sc.get_tag_name());
            case markup::scanner::TT_ATTR:
                printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value());
            case markup::scanner::TT_WORD: 
                case markup::scanner::TT_SPACE:
                    printf("{%S}\n", sc.get_value());
        return 0;

As you may see, the main method doing the job here is markup::scanner::get_token(). It scans the input stream and returns the value of markup::scanner::token_type.

enum token_type 
    TT_ERROR = -1,
    TT_EOF = 0,

    TT_TAG_START,   // <tag ...
                    //     ^-- happens here
    TT_TAG_END,     // </tag>
                    //       ^-- happens here 
                    // <tag ... />
                    //            ^-- or here 
    TT_ATTR,        // <tag attr="value" >      
                    //                  ^-- happens here   

    TT_DATA,        // content of following:

    TT_COMMENT_START, TT_COMMENT_END, // after "<!--" and "-->"
    TT_CDATA_START, TT_CDATA_END,     // after "<![CDATA[" and "]]>"
    TT_PI_START, TT_PI_END,           // after "<?" and "?>"
    TT_ENTITY_START, TT_ENTITY_END,   // after "<!ENTITY" and ">"

According to the value of the token, you can use get_tag_name(), get_value() or get_attr_name() to retrieve the needed information. This is pretty much all you need in order to be able to scan HTML/XML..

In Closing

The given scanner does not address any input stream encoding problems. XML and HTML are dealt with differently with this. A general idea for the cases where you don't know the input encoding up front: your input stream should be smart enough to be able to switch the encoding of the input on the fly. The given scanner was initially created as part of the HTMLayout SDK: a lightweight embeddable HTML rendering component.


  • 11 May 2006 - Initial version
  • 12 May 2006 - Article moved
  • 09 June 2006 - Bug fixes and a new VS 2005 project
  • 10 October 2007 - Download updated (bug fixes)


This article, along with any associated source code and files, is licensed under The BSD License


About the Author

Founder Terra Informatica Software
Canada Canada
Andrew Fedoniouk.
MS in Physics and Applied Mathematics.
Designing software applications and systems since 1991.
W3C HTML5 Working Group, Invited Expert.
Terra Informatica Software, Inc.

Comments and Discussions

NewsReused this great library PinmemberMember 824907219-Sep-11 4:47 
GeneralRe: Reused this great library Pinmemberc-smile19-Sep-11 18:46 
Questionunresolved external symbol Pinmembersupermegapup8-Aug-11 7:58 
AnswerRe: unresolved external symbol Pinmemberc-smile8-Aug-11 17:06 
GeneralEncoding UTF-8 PinmemberJoakim O'Nils4-Sep-08 22:51 
GeneralRe: Encoding UTF-8 Pinmemberc-smile11-Sep-08 17:42 
GeneralEncoding advise Pinmember_Stilgar_13-Dec-07 7:12 
GeneralRe: Encoding advise Pinmemberc-smile15-Jan-09 12:56 
Here is what I use for parsing UTF8 encoded streams:
  class mem_utf8_istream: public markup::instream 
    bytes buf;
    int   pos;
    mem_istream(bytes text) : buf(text), pos(0) { }
    virtual wchar_t get_char() { return getc_utf8(buf, pos); } 
Where getc_utf8 is this:
inline uint get_next_utf8(unsigned int val)
  // Check for the correct bits at the start.
  assert((val & 0xc0) == 0x80);
  //bad continuation of multi-byte UTF-8 sequence
  // Return the significant bits.
  return (val & 0x3f);
inline unsigned int getb(const bytes& buf, int& pos)
  if( uint(pos) >= buf.length )
    return 0;
  return buf[pos++];
// ATTN: UCS-2 only!
wchar getc_utf8(const bytes& buf, int& pos)
    unsigned int b1;
    bool is_surrogate = false;
    b1 = getb(buf,pos);
      return 0;
    // Determine whether we are dealing
    // with a one-, two-, three-, or four-
    // byte sequence.
    if ((b1 & 0x80) == 0)
      // 1-byte sequence: 000000000xxxxxxx = 0xxxxxxx
      return (wchar)b1;
    else if ((b1 & 0xe0) == 0xc0)
      // 2-byte sequence: 00000yyyyyxxxxxx = 110yyyyy 10xxxxxx
      uint r = (b1 & 0x1f) << 6;
           r |= get_next_utf8(getb(buf,pos));
      return (wchar)r;
    else if ((b1 & 0xf0) == 0xe0)
      // 3-byte sequence: zzzzyyyyyyxxxxxx = 1110zzzz 10yyyyyy 10xxxxxx
      uint r = (b1 & 0x0f) << 12;
           r |= get_next_utf8(getb(buf,pos)) << 6;
           r |= get_next_utf8(getb(buf,pos));
      return (wchar)r;
    else if ((b1 & 0xf8) == 0xf0)
      // 4-byte sequence: 11101110wwwwzzzzyy + 110111yyyyxxxxxx
      //     = 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
      // (uuuuu = wwww + 1)
      is_surrogate = true;
      return L'?';
      int b2 = get_next_utf8(pc++);
      int b3 = get_next_utf8(pc++);
      int b4 = get_next_utf8(pc++);
      buf +=
        (wchar)(0xd800 |
         ((((b1 & 0x07) << 2) | ((b2 & 0x30) >> 4) - 1) << 6) |
         ((b2 & 0x0f) << 2) |
         ((b3 & 0x30) >> 4));
      buf +=
        (wchar)(0xdc | ((b3 & 0x0f) << 6) | b4);
            // TODO: test that surrogate value is legal.
      return L'?';
      //bad start for UTF-8 multi-byte sequence"

GeneralRe: Encoding advise Pinmemberohad-oz12-Jul-10 15:37 
GeneralRe: Encoding advise Pinmembercodcode11-Mar-14 12:02 
GeneralUTF-8 input stream Pinmemberc-smile12-Apr-14 20:56 
GeneralThoughts on DOCTYPE ... PinmemberJerry Evans4-Nov-07 7:00 
GeneralRe: Thoughts on DOCTYPE ... Pinmemberc-smile4-Nov-07 10:39 
GeneralRe: Thoughts on DOCTYPE ... PinmemberJerry Evans4-Nov-07 14:54 
GeneralRe: Thoughts on DOCTYPE ... Pinmemberc-smile4-Nov-07 17:32 
GeneralRe: Thoughts on DOCTYPE ... PinmemberJerry Evans5-Nov-07 2:24 
GeneralRe: Thoughts on DOCTYPE ... PinmemberJerry Evans5-Nov-07 2:36 
GeneralRe: Thoughts on DOCTYPE ... Pinmemberc-smile5-Nov-07 17:26 
GeneralExtremely useful. PinmemberJerry Evans12-Oct-07 13:21 
GeneralRe: Extremely useful. Pinmemberc-smile12-Oct-07 13:37 
GeneralURL of latest source code Pinmemberc-smile7-Oct-07 22:18 
Generalnuther bug [modified] Pinmemberdr3d4-Jul-07 21:10 
GeneralRe: nuther bug [modified] Pinmemberc-smile6-Oct-07 19:19 
QuestionNested tags PinmemberNulleh28-Oct-06 19:10 
AnswerRe: Nested tags Pinmemberc-smile28-Oct-06 21:39 
GeneralRe: Nested tags PinmemberNulleh29-Oct-06 18:10 
GeneralProcess information token PinmemberJacob Skjoet27-Sep-06 7:29 
GeneralRe: Process information token Pinmemberosocha27-Oct-06 0:14 
QuestionHow to deal with this condition? [modified] Pinmemberdigitalpump16-Jun-06 22:26 
AnswerRe: How to deal with this condition? [modified] Pinmemberc-smile16-Jun-06 22:59 
GeneralRe: How to deal with this condition? [modified] Pinmemberdigitalpump16-Jun-06 23:24 
QuestionHow to get/set position in a stream? Pinmember_Stilgar_31-May-06 17:33 
AnswerRe: How to get/set position in a stream? Pinmemberc-smile1-Jun-06 14:02 
GeneralRe: How to get/set position in a stream? Pinmember_Stilgar_3-Jun-06 9:29 
GeneralRe: How to get/set position in a stream? [modified] Pinmemberc-smile3-Jun-06 11:07 
GeneralRe: How to get/set position in a stream? [modified] Pinmember_Stilgar_6-Jun-06 6:50 
GeneralRe: How to get/set position in a stream? [modified] Pinmemberc-smile8-Jun-06 17:48 
GeneralRe: How to get/set position in a stream? [modified] Pinmember_Stilgar_9-Jun-06 2:02 
GeneralBug in xh_scanner.cpp Pinmemberper.larsson17-May-06 23:45 
GeneralRe: Bug in xh_scanner.cpp Pinmemberc-smile18-May-06 19:28 
GeneralRe: Bug in xh_scanner.cpp Pinmember_Stilgar_19-May-06 1:36 
GeneralMFC wrapper Pinmember_Stilgar_12-May-06 2:44 
GeneralRe: MFC wrapper Pinmemberc-smile12-May-06 7:13 
GeneralRe: MFC wrapper PinmemberDarka12-May-06 8:05 
GeneralRe: MFC wrapper Pinmemberc-smile12-May-06 22:19 
GeneralRe: MFC wrapper Pinmember_Stilgar_13-May-06 11:52 
GeneralRe: MFC wrapper Pinmemberc-smile13-May-06 12:58 
GeneralRe: MFC wrapper Pinmember_Stilgar_13-May-06 13:11 
GeneralRe: MFC wrapper Pinmemberc-smile14-May-06 10:54 
GeneralRe: MFC wrapper Pinmember_Stilgar_14-May-06 11:04 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150327.1 | Last Updated 10 Oct 2007
Article Copyright 2006 by c-smile
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid