Fast and Compact HTML/XML Scanner/Tokenizer

Just to let people know that we reused much of this great scanner code to implement a small and fast scanner that can be called from serverside JavaScript environment called Node.

Using it becomes very easy, as the following snippet shows:

JavaScript

var scanner = new Scanner("<div>content</div>");
do {
  token = scanner.next();
  console.dir(token);
} while (token[0]);

You can find the source code at: https://github.com/jbaron/htmlscanner

P.S Tested only on Linux and not yet other platforms. To get it compiling on Windows you also need to get de relevant Node stuff up and running which could be a challenge.

regards,
JBaron

Great! and thanks for the update, good to know that it can be used with node.js.

By the way, it is used in my TIScript (and so in Sciter) too and wrapped this way:
http://www.terrainformatica.com/tiscript/XMLScanner.whtm[^]

I am using slightly different variation of the scanner from the article.
http://code.google.com/p/tiscript/source/browse/trunk/tool/tl_markup.h[^]. In particular it emits WORD and SPACE tokens rather than just a TEXT. May or may not be useful in your case.

Error 4 error LNK2001: unresolved external symbol "private: enum markup::scanner::token_type __thiscall markup::scanner::scan_body(void)" (?scan_body@scanner@markup@@AAE?AW4token_type@12@XZ)

Did you include xh_scanner.cpp in your project?

Hi,

Here is a try to make this work for UTF-8 encoded text.
I got the if's for different length UTF-8 multibyte characters from
wikipedia explaining UTF-8 encoding:

struct str_istream: public markup::instream
{
  const char* p;
  const char* end;

  str_istream(const char* src): p(src), end(src + strlen(src)) {}
  virtual wchar_t get_char(); // { return p < end? *p++: 0; }
};

int mbtowc_Utf8( wchar_t *wchar, const char *mbchar, size_t count)
{
   int res = ::MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, mbchar, count, wchar, 1);
  
   if (res <= 0)
   {
     res =  -1;
   }

   return res;
}


wchar_t str_istream::get_char()
{
   if (p < end)
   {
      const char* ps = p;
      if ((0xE0 & *p) == 0xC0)
      {
        // Char count == 2
        p++;
        if (p >= end) return 0;
        p++;
        
        wchar_t wch;
        if (mbtowc_Utf8(&wch, ps, 2) == -1)
        {
           return '?';
        }
        
        return wch;

      }
      else if ((0xF0 & *p) == 0xE0)
      {
        // Char count == 3
        p++;
        if (p >= end) return 0;
        p++;
        if (p >= end) return 0;
        p++;
        
        wchar_t wch;
        if (mbtowc_Utf8(&wch, ps, 3) == -1)
        {
           return '?';
        }
        
        return wch;
      }
      else if ((0xF8 & *p) == 0xF0)
      {
        // Char count == 4
        p++;
        if (p >= end) return 0;
        p++;
        if (p >= end) return 0;
        p++;
        if (p >= end) return 0;
        p++;
        
        wchar_t wch;
        if (mbtowc_Utf8(&wch, ps, 4) == -1)
        {
           return '?';
        }
        
        return wch;
      } 
      else
      {
        return *p++;
      }
   }
   else
   {
      return 0;
   }
}

Here is a sample text that I made a try with:

const char* inp = 
  "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
  "<!-- Generator: Adobe Illustrator 9.0, SVG Export Plug-In  -->"
  "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 20000303 Stylable//EN\"   \"http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd\" ["
  "    <!ENTITY st0 \"fill:#E61408;\">"
  "    <!ENTITY st1 \"fill:#1C1585;\">"
  "]>"
  "<svg>text &#x61; <UTF8NODE ucAtr=AB\xef\xbb\xbf\xd8\xa7\xd9\x84\xd9\x87 /></svg>";

Here is what can be used for utf8 input stream implementation.

bytes here is simple struct

struct bytes 
{ 
  byte* start;
  uint  length;
}

Function wchar getc_utf8(const bytes& buf, int& pos) does the job of converting input sequence of bytes into single UCS-2 (16bit) codepoint - basic multilingual plane subset of big UNICODE.

inline uint get_next_utf8(unsigned int val)
{
  // Check for the correct bits at the start.
  assert((val & 0xc0) == 0x80);
  //bad continuation of multi-byte UTF-8 sequence
  // Return the significant bits.
  return (val & 0x3f);
}

inline unsigned int getb(const bytes& buf, int& pos) 
{
  if( uint(pos) >= buf.length )
    return 0;
  return buf[pos++];
}

// ATTN: UCS-2 only!
wchar getc_utf8(const bytes& buf, int& pos)
{
    unsigned int b1;
    bool isSurrogate = false;

    b1 = getb(buf,pos);
    if(!b1)
      return 0;
    isSurrogate = false;

    // Determine whether we are dealing
    // with a one-, two-, three-, or four-
    // byte sequence.
    if ((b1 & 0x80) == 0) 
    {
	    // 1-byte sequence: 000000000xxxxxxx = 0xxxxxxx
	    return (wchar)b1;
    } 
    else if ((b1 & 0xe0) == 0xc0) 
    {
	    // 2-byte sequence: 00000yyyyyxxxxxx = 110yyyyy 10xxxxxx
	    uint r = (b1 & 0x1f) << 6;
           r |= get_next_utf8(getb(buf,pos));
      return (wchar)r; 
    } 
    else if ((b1 & 0xf0) == 0xe0) 
    {
	    // 3-byte sequence: zzzzyyyyyyxxxxxx = 1110zzzz 10yyyyyy 10xxxxxx
      uint r = (b1 & 0x0f) << 12;
           r |= get_next_utf8(getb(buf,pos)) << 6;
           r |= get_next_utf8(getb(buf,pos));
      return (wchar)r;
    } 
    else if ((b1 & 0xf8) == 0xf0) 
    {
	    // 4-byte sequence: 11101110wwwwzzzzyy + 110111yyyyxxxxxx
	    //     = 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
	    // (uuuuu = wwww + 1)
	    isSurrogate = true;
           return L'?';
      /*
	    int b2 = get_next_utf8(pc++);
	    int b3 = get_next_utf8(pc++);
	    int b4 = get_next_utf8(pc++);
	    buf += 
	      (wchar)(0xd800 |
		     ((((b1 & 0x07) << 2) | ((b2 & 0x30) >> 4) - 1) << 6) |
		     ((b2 & 0x0f) << 2) |
		     ((b3 & 0x30) >> 4));
	    buf +=
	      (wchar)(0xdc | ((b3 & 0x0f) << 6) | b4);
				    // TODO: test that surrogate value is legal.
      */
    } 
    else 
    {
      assert(0);
      return L'?';
      //bad start for UTF-8 multi-byte sequence"
    }
}

UCS-2 strings are used by Windows (so called LPCWSTR) prior XP. In Windows XP LPCWSTR is UTF-16 string.
Welcome to UTFs! Smile | :)

Hi,

This is the code I'm using in the sample app:

struct ascii_file_istream : public markup::instream<br />
{<br />
	FILE *f;<br />
	unsigned int pos;<br />
	ascii_file_istream(const char* filename) : pos(0), f(NULL) { f = fopen(filename, "rb"); }<br />
	virtual wchar_t get_char() { wchar_t c; pos++; return fread(&c,sizeof(wchar_t),1,f)? c : 0; }<br />
	~ascii_file_istream() { fclose(f); }<br />
	bool is_file() { return (!(f==NULL)); }<br />
};<br />
<br />
int main(int argc, char* argv[])<br />
{<br />
  ascii_file_istream fi("c:\\testfile.htm");<br />
<br />
  if (!fi.is_file())<br />
	  return 0;<br />
<br />
  markup::scanner sc(fi);<br />
  bool in_text = false;<br />
  while(true)<br />
  {<br />
    int t = sc.get_token();<br />
    switch(t)<br />
    {<br />
      case markup::scanner::TT_EOF:<br />
        printf("EOF\n");<br />
        goto FINISH;<br />
      case markup::scanner::TT_SPACE:<br />
	printf("SPACE\n");<br />
	break;<br />
      case markup::scanner::TT_WORD:<br />
	{<br />
	  const markup::wchar* w = sc.get_value();<br />
	  printf("WORD: {%S}\n", sc.get_value());<br />
	}<br />
        break;<br />
      // The rest of the cases<br />
      // ...<br />
  }<br />
FINISH:<br />
  printf("--------------------------\n");<br />
  return 0;<br />
}

This code works perfectly well when used to scan English documents. It is not working when I'm using it to scan documents with non-english words. The only way I could make it work is by setting the default charset to Unicode in the project properties, re-saving the file as Unicode with Notepad (UTF8 didn't work to, only Unicode), only then would w in the TT_WORD case got the correct value and not a set of squares. Also note I'm reading wchar_t from the file, not char as you suggested in the original code you posted a while ago. I never got printf (nor wprintf) to output the correct chars to the screen, even when the word was read allright.

My question is that: how can I read the non-english file correctly also when it is saved as UTF8 or ANSI? does the scanner itself aware of those characters, or does it just ignore them so once I'm calling sc.get_value() I can convert it to whatever encoding I need and it will display it fine?
Also, can I still read the file with your code:
virtual wchar_t get_char() { char c; pos++; return fread(&c,1,1,f)? c : 0; }
and still read the non-english words correctly?
The best practice would be for the reader to be indepndant of the file encoding. Is that possible?

Please advise.

Stilgar.

Here is what I use for parsing UTF8 encoded streams:

class mem_utf8_istream: public markup::instream
{
  bytes buf;
  int   pos;
public:
  mem_istream(bytes text) : buf(text), pos(0) { }
  virtual wchar_t get_char() { return getc_utf8(buf, pos); }
};

Where getc_utf8 is this:

inline uint get_next_utf8(unsigned int val)
{
  // Check for the correct bits at the start.
  assert((val & 0xc0) == 0x80);
  //bad continuation of multi-byte UTF-8 sequence
  // Return the significant bits.
  return (val & 0x3f);
}

inline unsigned int getb(const bytes& buf, int& pos)
{
  if( uint(pos) >= buf.length )
    return 0;
  return buf[pos++];
}

// ATTN: UCS-2 only!
wchar getc_utf8(const bytes& buf, int& pos)
{
    unsigned int b1;
    bool is_surrogate = false;

    b1 = getb(buf,pos);
    if(!b1)
      return 0;

    // Determine whether we are dealing
    // with a one-, two-, three-, or four-
    // byte sequence.
    if ((b1 & 0x80) == 0)
    {
      // 1-byte sequence: 000000000xxxxxxx = 0xxxxxxx
      return (wchar)b1;
    }
    else if ((b1 & 0xe0) == 0xc0)
    {
      // 2-byte sequence: 00000yyyyyxxxxxx = 110yyyyy 10xxxxxx
      uint r = (b1 & 0x1f) << 6;
           r |= get_next_utf8(getb(buf,pos));
      return (wchar)r;
    }
    else if ((b1 & 0xf0) == 0xe0)
    {
      // 3-byte sequence: zzzzyyyyyyxxxxxx = 1110zzzz 10yyyyyy 10xxxxxx
      uint r = (b1 & 0x0f) << 12;
           r |= get_next_utf8(getb(buf,pos)) << 6;
           r |= get_next_utf8(getb(buf,pos));
      return (wchar)r;
    }
    else if ((b1 & 0xf8) == 0xf0)
    {
      // 4-byte sequence: 11101110wwwwzzzzyy + 110111yyyyxxxxxx
      //     = 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
      // (uuuuu = wwww + 1)
      is_surrogate = true;
      return L'?';
      /*
      int b2 = get_next_utf8(pc++);
      int b3 = get_next_utf8(pc++);
      int b4 = get_next_utf8(pc++);
      buf +=
        (wchar)(0xd800 |
         ((((b1 & 0x07) << 2) | ((b2 & 0x30) >> 4) - 1) << 6) |
         ((b2 & 0x0f) << 2) |
         ((b3 & 0x30) >> 4));
      buf +=
        (wchar)(0xdc | ((b3 & 0x0f) << 6) | b4);
            // TODO: test that surrogate value is legal.
      */
    }
    else
    {
      assert(0);
      return L'?';
      //bad start for UTF-8 multi-byte sequence"
    }
}

Hi c-smile,
Great parser - it's exactly what i need.
when I'm following your notes regarding UTF8 problems pop-up.

few questions:
1. the name of the class & the constructor is different - i assume it is typo and should be - mem_utf8_istream(bytes text) : buf(text), pos(0) { }
right?!?!
2. when adding this class, i get compiler error for "virtual wchar_t get_char()... " which says that - 'getc_utf8': identifier not found.
when i add the prototype into instream struct, i hit compiler link error.
can you please clarify what is missing or maybe update the code with UTF8 support.

Thanks!

did somebody make it work for utf-8? Sniff | :^)

Here is ucode_from_utf8(bytes& buf) function that reads one wide char from sequence of bytes. You can use it to implement input stream that reads utf-8 encoded markup stream.

C++

// range of bytes
struct bytes {
  const unsigned char* start;
  size_t               length; 
};

inline uint get_next_utf8(unsigned int val)
{
  // Check for the correct bits at the start.
  assert((val & 0xc0) == 0x80);
  //bad continuation of multi-byte UTF-8 sequence

  // Return the significant bits.
  return (val & 0x3f);
}

// gets one byte from bytes sequence reducing range by 1
inline unsigned int getb(bytes& buf)
{
  if( buf.length == 0)
    return 0;
  unsigned char b = *buf.start;
  ++buf.start; 
  --buf.length;
  return b;
}

// gets one unicode code point from sequence of utf8 bytes.
// reduces bytes range accordingly
uint ucode_from_utf8(bytes& buf)
{
    unsigned int b1;
    
    b1 = getb(buf);
    if(!b1)
      return 0;
    
    // Determine whether we are dealing
    // with a one-, two-, three-, or four-
    // byte sequence.
    if ((b1 & 0x80) == 0)
    {
      // 1-byte sequence: 000000000xxxxxxx = 0xxxxxxx
      return (wchar)b1;
    }
    else if ((b1 & 0xe0) == 0xc0)
    {
      // 2-byte sequence: 00000yyyyyxxxxxx = 110yyyyy 10xxxxxx
      uint r = (b1 & 0x1f) << 6;
           r |= get_next_utf8(getb(buf));
      return (wchar)r;
    }
    else if ((b1 & 0xf0) == 0xe0)
    {
      // 3-byte sequence: zzzzyyyyyyxxxxxx = 1110zzzz 10yyyyyy 10xxxxxx
      uint r = (b1 & 0x0f) << 12;
           r |= get_next_utf8(getb(buf)) << 6;
           r |= get_next_utf8(getb(buf));
      return (wchar)r;
    }
    else if ((b1 & 0xf8) == 0xf0)
    {
      // 4-byte sequence: 11101110wwwwzzzzyy + 110111yyyyxxxxxx
      //     = 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
      // (uuuuu = wwww + 1)
      int b2 = get_next_utf8(getb(buf));
      int b3 = get_next_utf8(getb(buf));
      int b4 = get_next_utf8(getb(buf));
      return ((b1 & 7) << 18) | ((b2 & 0x3f) << 12) |
              ((b3 & 0x3f) << 6) | (b4 & 0x3f);
    }
    else
    {
      assert(0);
      return L'?';
      //bad start for UTF-8 multi-byte sequence
    }
}

This fragment causes problems:

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" [
<!ENTITY st0 "fill:#E61408;">
<!ENTITY st1 "fill:#1C1585;">
]>

This is parsed as:

TAG START: ?xml
TT_ATTR: version 1.0
TT_ATTR: encoding utf-8
TT_ATTR: ?
TT_DATA: ? Generator: Adobe Illustrator 9.0, SVG Export Plug-In
TAG START: !DOCTYPE
TT_ATTR: svg
TT_ATTR: PUBLIC
TT_ATTR: "-//W3C//DTD
TT_ATTR: SVG
TT_ATTR: 20000303
TT_ATTR: Stylable//EN"
TT_ATTR: "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd"
TT_ATTR: [
TT_ATTR: !ENTITY
TT_ATTR: st0
TT_ATTR: "fill:#E61408;"
TT_ENTITY_START: !ENTITY
TT_DATA: "fill:#E61408;" st1 "fill:#1C1585;"
TT_ENTITY_END: !ENTITY

Any thoughts on the best way to integrate fix? Perhaps by having a dedicated scan_doctype() ?

TIA

Jerry

Thanks for that, I've updated distribution at:
http://www.terrainformatica.com/org/xh_scanner_demo.zip

Scanning loop looks like this now:

while(true)
  {
    int t = sc.get_token();
    switch(t)
    {
      case markup::scanner::TT_ERROR:
        printf("ERROR\n");
        break;
      case markup::scanner::TT_EOF:
        printf("EOF\n");
        goto FINISH;
      case markup::scanner::TT_TAG_START:
        printf("TAG START:%s\n", sc.get_tag_name());
        break;
      case markup::scanner::TT_TAG_END:
        printf("TAG END:%s\n", sc.get_tag_name());
        break;
      case markup::scanner::TT_ATTR:
        printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value());
        break;
      case markup::scanner::TT_WORD: 
      case markup::scanner::TT_SPACE:
        printf("{%S}\n", sc.get_value());
        break;
      case markup::scanner::TT_PI_START:
        printf("\tPI");
        break;
      case markup::scanner::TT_PI_END:
        printf("\n");
        break;
      case markup::scanner::TT_DOCTYPE_START:
        printf("\tDOCTYPE");
        break;
      case markup::scanner::TT_DOCTYPE_END:
        printf("\n");
        break;
      case markup::scanner::TT_DATA:
        printf("[%S]", sc.get_value());
        break;
    }

Andrew, that is very helpful but there is still a problem recognising the first entity in the DOCTYPE section. The output below is from parsing the original doctype example.

TT_DOCTYPE_START
TT_DATA: svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" [

(Implied)TT_DOCTYPE_END

I see.

I have updated sources again with the fix.

Scanner is not trying to parse content of DOCTYPE - it just passes content "as is" to the caller.
Thus scanner is not doing any DTD parsing. That is out of scope of the scanner anyway.
If someone will want to implement such parsing/support - let me know.

Many thanks. For my purposes parsing the DTD is overkill but it is important to be ale to extract entity definitions correctly. This is an excellent tool for the job.

OK, the current method is fine as the entities in the doctype scope can be parsed separately. one trivial suggestion: For clarity: how about adding add TT_DOCTYPE_DATA enumeration which is returned by scanner::scan_doctype_decl()? this makes client code a tiny bit simpler and intent is clearer.

Thanks again.

It returns now one or more TT_DATAs tokens.

In principle typical text->DOM parser should have something like this:

switch( token_stream.get_token() ) 
{
  case  TT_DOCTYPE_START:
    parse_DOCTYPE(token_stream);
    break;
  case  TT_COMMENT_START:
    ...
}

where parse_DOCTYPE() in its turn shall have inside:

while(1)
  switch( token_stream.get_token() ) 
{
  case  TT_ENTITY_DECL_START:
    parse_ENTITY_decl(token_stream);
    break;
  case  TT_ATTR_DECL_START:
    parse_ATTR_decl(token_stream);
    break;
  ...
}

But again this scanner was designed for cases when "linear" XML/HTML scanning is required. So DOCTYPE and local DTD parsing was out of scope.
Typical use ccase: HTML -> plain text converter.
Another example: we use customized version of the scanner for DOM-less SVG rendering in htmlayout/sciter.
It scans SVG (some subset of) and draws elements as they appear in the source. Without building SVG DOM.

I've commented out TT_ENTITY_DECL_START handling. It shall be enabled if someone will decide to build full parser.
Things like TT_ATTR_DECL_START can be added in the same way as TT_ENTITY_DECL_START.

Many thanks Andrew,

you have enabled me to cross another task off my todo list. Gets my 5.

Can you clarify licensing please?

Thx++

Jerry.

I am publishing all my stuff with BSD license.
So is this one too has BSD license.

http://www.terrainformatica.com/org/xh_scanner_demo.zip

I've spotted what appears to be another problem with your very useful markup scanner.

Sometimes an attribute-value might be a quoted-url that includes a querystring:

The scanner sees the '&' and calls scan_entity(). That function will read up to 31 chars looking for the terminating ';'. When it doesn't find it, it simply appends those 31 chars to the value of the attribute. In the case above, that means the terminating '"' is passed over...

My solution is good enuf for my purpose but isn't perfect: I pass a delimiter to scan_entity() which it also checks-for when collecting chars into its buf[]. The for() loop winds up looking like:

<font face=consolas>...
      for(; i < 31 ; ++i )
      {
        t = get_char();
        if(t == 0) return TT_EOF;
        if (delim == ' ' && is_whitespace(t) || t == delim) {
            push_back(t);
            append_value('&');
            for(int n = 0; n < i-1; ++n)
               append_value(buf[n]);
            return buf[i-1];
        }
        buf[i] = char(t); 
        if(t == ';')
          break;
      }</font>

the 4 calls to scan_entity() become

<font face=consolas>...
  scan_entity(0);
...
  scan_entity('"');
...
  scan_entity('\'');
...
  scan_entity(' ');
</font>

-scott

Thanks dr3d.

In fact fragment at xh_scanner.cpp (126),
needs to be fixed as this:

else  // scan token, allowed in html: e.g. align=center
  do
  {
      if( is_whitespace(c) ) return TT_ATTR;
      /* these two removed in favour of better html support:
      if( c == '/' || c == '>' ) { push_back(c); return TT_ATTR; }
      if( c == '&' ) c = scan_entity();*/
      if( c == '>' ) { push_back(c); return TT_ATTR; }
      append_value(c);
  } while(c = get_char());

I also slightly updated scan_entity().

I've sent updates to codeproject people so these changes will appear soon in source.

Nice lightweight parser, thanks for sharing.
I think I may have found a bug: if a nested tag begins and ends, get_tag_name() still returns that nested tag. For example: "

Text bold text

" For each word, the parser says that its tag is(respectively): "p b b b" This is incorrect, as it should output "p b b p"

-Tyson

get_tag_name() returns actual value only for TT_TAG_START and TT_TAG_END tokens.
Otherwise tokenizer shall maintain stack of elements. That require memory allocations - thing I was trying to avoid in scanner for many reasons.

Fast and Compact HTML/XML Scanner/Tokenizer

Introduction

How to Use

In Closing

History

License

Comments and Discussions