|
 |
|
|
Hi,
This is the code I'm using in the sample app:
struct ascii_file_istream : public markup::instream { FILE *f; unsigned int pos; ascii_file_istream(const char* filename) : pos(0), f(NULL) { f = fopen(filename, "rb"); } virtual wchar_t get_char() { wchar_t c; pos++; return fread(&c,sizeof(wchar_t),1,f)? c : 0; } ~ascii_file_istream() { fclose(f); } bool is_file() { return (!(f==NULL)); } };
int main(int argc, char* argv[]) { ascii_file_istream fi("c:\\testfile.htm");
if (!fi.is_file()) return 0;
markup::scanner sc(fi); bool in_text = false; while(true) { int t = sc.get_token(); switch(t) { case markup::scanner::TT_EOF: printf("EOF\n"); goto FINISH; case markup::scanner::TT_SPACE: printf("SPACE\n"); break; case markup::scanner::TT_WORD: { const markup::wchar* w = sc.get_value(); printf("WORD: {%S}\n", sc.get_value()); } break; // The rest of the cases // ... } FINISH: printf("--------------------------\n"); return 0; }
This code works perfectly well when used to scan English documents. It is not working when I'm using it to scan documents with non-english words. The only way I could make it work is by setting the default charset to Unicode in the project properties, re-saving the file as Unicode with Notepad (UTF8 didn't work to, only Unicode), only then would w in the TT_WORD case got the correct value and not a set of squares. Also note I'm reading wchar_t from the file, not char as you suggested in the original code you posted a while ago. I never got printf (nor wprintf) to output the correct chars to the screen, even when the word was read allright.
My question is that: how can I read the non-english file correctly also when it is saved as UTF8 or ANSI? does the scanner itself aware of those characters, or does it just ignore them so once I'm calling sc.get_value() I can convert it to whatever encoding I need and it will display it fine? Also, can I still read the file with your code:
virtual wchar_t get_char() { char c; pos++; return fread(&c,1,1,f)? c : 0; } and still read the non-english words correctly? The best practice would be for the reader to be indepndant of the file encoding. Is that possible?
Please advise.
Stilgar.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
This fragment causes problems:
<?xml version="1.0" encoding="utf-8"?> <!-- Generator: Adobe Illustrator 9.0, SVG Export Plug-In --> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" [ <!ENTITY st0 "fill:#E61408;"> <!ENTITY st1 "fill:#1C1585;"> ]>
This is parsed as:
TAG START: ?xml TT_ATTR: version 1.0 TT_ATTR: encoding utf-8 TT_ATTR: ? TT_DATA: ? Generator: Adobe Illustrator 9.0, SVG Export Plug-In TAG START: !DOCTYPE TT_ATTR: svg TT_ATTR: PUBLIC TT_ATTR: "-//W3C//DTD TT_ATTR: SVG TT_ATTR: 20000303 TT_ATTR: Stylable//EN" TT_ATTR: "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" TT_ATTR: [ TT_ATTR: !ENTITY TT_ATTR: st0 TT_ATTR: "fill:#E61408;" TT_ENTITY_START: !ENTITY TT_DATA: "fill:#E61408;" st1 "fill:#1C1585;" TT_ENTITY_END: !ENTITY
Any thoughts on the best way to integrate fix? Perhaps by having a dedicated scan_doctype() ?
TIA
Jerry
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Thanks for that, I've updated distribution at: http://www.terrainformatica.com/org/xh_scanner_demo.zip
Scanning loop looks like this now:
while(true) { int t = sc.get_token(); switch(t) { case markup::scanner::TT_ERROR: printf("ERROR\n"); break; case markup::scanner::TT_EOF: printf("EOF\n"); goto FINISH; case markup::scanner::TT_TAG_START: printf("TAG START:%s\n", sc.get_tag_name()); break; case markup::scanner::TT_TAG_END: printf("TAG END:%s\n", sc.get_tag_name()); break; case markup::scanner::TT_ATTR: printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value()); break; case markup::scanner::TT_WORD: case markup::scanner::TT_SPACE: printf("{%S}\n", sc.get_value()); break; case markup::scanner::TT_PI_START: printf("\tPI"); break; case markup::scanner::TT_PI_END: printf("\n"); break; case markup::scanner::TT_DOCTYPE_START: printf("\tDOCTYPE"); break; case markup::scanner::TT_DOCTYPE_END: printf("\n"); break; case markup::scanner::TT_DATA: printf("[%S]", sc.get_value()); break; }
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Andrew, that is very helpful but there is still a problem recognising the first entity in the DOCTYPE section. The output below is from parsing the original doctype example.
TT_DOCTYPE_START TT_DATA: svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" [ <!ENTITY st0 "fill:#E61408;" (Implied)TT_DOCTYPE_END
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I see.
I have updated sources again with the fix.
Scanner is not trying to parse content of DOCTYPE - it just passes content "as is" to the caller. Thus scanner is not doing any DTD parsing. That is out of scope of the scanner anyway. If someone will want to implement such parsing/support - let me know.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Many thanks. For my purposes parsing the DTD is overkill but it is important to be ale to extract entity definitions correctly. This is an excellent tool for the job.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
OK, the current method is fine as the entities in the doctype scope can be parsed separately. one trivial suggestion: For clarity: how about adding add TT_DOCTYPE_DATA enumeration which is returned by scanner::scan_doctype_decl()? this makes client code a tiny bit simpler and intent is clearer.
Thanks again.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
It returns now one or more TT_DATAs tokens.
In principle typical text->DOM parser should have something like this:
switch( token_stream.get_token() ) { case TT_DOCTYPE_START: parse_DOCTYPE(token_stream); break; case TT_COMMENT_START: ... }
where parse_DOCTYPE() in its turn shall have inside:
while(1) switch( token_stream.get_token() ) { case TT_ENTITY_DECL_START: parse_ENTITY_decl(token_stream); break; case TT_ATTR_DECL_START: parse_ATTR_decl(token_stream); break; ... }
But again this scanner was designed for cases when "linear" XML/HTML scanning is required. So DOCTYPE and local DTD parsing was out of scope. Typical use ccase: HTML -> plain text converter. Another example: we use customized version of the scanner for DOM-less SVG rendering in htmlayout/sciter. It scans SVG (some subset of) and draws elements as they appear in the source. Without building SVG DOM.
I've commented out TT_ENTITY_DECL_START handling. It shall be enabled if someone will decide to build full parser. Things like TT_ATTR_DECL_START can be added in the same way as TT_ENTITY_DECL_START.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Many thanks Andrew,
you have enabled me to cross another task off my todo list. Gets my 5.
Can you clarify licensing please?
Thx++
Jerry.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
 |
|
|
I've spotted what appears to be another problem with your very useful markup scanner.
Sometimes an attribute-value might be a quoted-url that includes a querystring:
<input value="http://mysite?foo&bar" type=hidden name=foobar>
The scanner sees the '&' and calls scan_entity(). That function will read up to 31 chars looking for the terminating ';'. When it doesn't find it, it simply appends those 31 chars to the value of the attribute. In the case above, that means the terminating '"' is passed over...
My solution is good enuf for my purpose but isn't perfect: I pass a delimiter to scan_entity() which it also checks-for when collecting chars into its buf[]. The for() loop winds up looking like:
... for(; i < 31 ; ++i ) { t = get_char(); if(t == 0) return TT_EOF; if (delim == ' ' && is_whitespace(t) || t == delim) { push_back(t); append_value('&'); for(int n = 0; n < i-1; ++n) append_value(buf[n]); return buf[i-1]; } buf[i] = char(t); if(t == ';') break; } the 4 calls to scan_entity() become
... scan_entity(0); ... scan_entity('"'); ... scan_entity('\''); ... scan_entity(' ');
-scott
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Thanks dr3d.
In fact fragment at xh_scanner.cpp (126), needs to be fixed as this:
else // scan token, allowed in html: e.g. align=center do { if( is_whitespace(c) ) return TT_ATTR; /* these two removed in favour of better html support: if( c == '/' || c == '>' ) { push_back(c); return TT_ATTR; } if( c == '&' ) c = scan_entity();*/ if( c == '>' ) { push_back(c); return TT_ATTR; } append_value(c); } while(c = get_char());
I also slightly updated scan_entity().
I've sent updates to codeproject people so these changes will appear soon in source.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Nice lightweight parser, thanks for sharing. I think I may have found a bug: if a nested tag begins and ends, get_tag_name() still returns that nested tag. For example: "Text bold text " For each word, the parser says that its tag is(respectively): "p b b b" This is incorrect, as it should output "p b b p"
-Tyson
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
get_tag_name() returns actual value only for TT_TAG_START and TT_TAG_END tokens. Otherwise tokenizer shall maintain stack of elements. That require memory allocations - thing I was trying to avoid in scanner for many reasons.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Thanks for the reply! I understand; the stack was easily implemented. I just wanted to bring this to your attention in the case that it was a bug[not really a bug, but a feature left out that you meant to put in]. The small memory footprint is fantastic, especially for the environment I am deploying this for. Thanks. -Tyson
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Thank you for the contribution.
It seems that the tokenizer never returns TT_PI_START/END.
This means that eg <?xml version="1.0" encoding="iso-8859-1"?>
is matched as a standard tag without a tag end.
Is this a bug or just me missing something ?
Jacob Skjøt
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hello!
I made simple work around:
1) scanner::token_type scanner::scan_head() { wchar c = skip_whitespace();
if(c == '>') { c_scan = &scanner::scan_body; return scan_body(); } if(c == '?') //***pi { wchar t = get_char(); if(t == '>') { c_scan = &scanner::scan_body; return TT_PI_END; } else { push_back(t); return TT_ERROR; } } if(c == '/') ....... 2)
scanner::token_type scanner::scan_tag() { ......... switch(tag_name_length) { case 4: //***pi if(equal(tag_name,"?xml",4)) { c_scan = &scanner::scan_head; return TT_PI_START; } break; case 3: ...........
I don't use scan_pi() at all because it returns data as string, while my way uses scan_head() and returns attrib/value pairs.
This code works for me, but it was quit work around w/o in deep analysis!
It is written so simply (e.g. so good) that I converted it to C (not C++) code for my own purposes!
C-smile - it' really great job!!!
Greetings Przemek
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi. Usual we may come up against some HTML code like this "<a href=/script/profile/whos_who.asp?id=26162>Jack Hui</a> ". At here,The "/" symbol was treat with an end tag symbol in your code. But in fact that was an error judgement. do you think so? thanks.
digitalpump.
www.51qk.com
-- modified at 3:26 Saturday 17th June, 2006
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
HTML allows CDATA attributes to be unquoted provided the attribute value contains only letters (a to z and A to Z), digits (0 to 9), hyphens (ASCII decimal 45) or, periods (ASCII decimal 46). Attribute values can be quoted using double or single quote marks (ASCII decimal 34 and 39 respectively). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa.
Source: http://www.w3.org/TR/REC-html32#sgml[^]
Currently parser uses one lookahead character. To support attributes like src=/something it needs to have two characters in pushback buffer. Possible, but...
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi. Oh,you may mistake my meaning. for example, may be there is an abnormal HTML tag like this:
<a href=/script/profile/whos_who.asp?id=26162>Jack Hui</a> --abnormal tag a normal HTML tag like this: <a href='/script/profile/whos_who.asp?id=26162'>jack Hui</a> --normal tag Do you see the different between above? At the first sample(abnormal tag), there was no quote mark surround the attribute value. So, The "/" symbol after "=" was treat with an end tag symbol in your code.
digitalpump
www.51qk.com
-- modified at 4:24 Saturday 17th June, 2006
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
How can I get/set the current scanner position in a given stream, without triggering a new read? This way I can use the scanner for DOM operations as well, and not depand on MSXML or similar.
Also, how would you perform reading from a file and immediately parsing its contents using this scanner, without buffering the content in a char*/wchar*?
Stilgar.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
"How can I get/set the current scanner position in a given stream, without triggering a new read? This way I can use the scanner for DOM operations as well, and not depand on MSXML or similar." Question is not clear. Create new stream based on fragment you have and parse it. Is this close to what you want?
"Also, how would you perform reading from a file and immediately parsing its contents using this scanner, without buffering the content in a char*/wchar*?" Also is not clear. 1) For simple cases I am using mm_file class similar to http://www.codeproject.com/cpp/flattables.asp . 2) Buffering (of current token, sic!) is needed for many reasons. For example current word (token) has things like "<" esacaped.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
When scanning a stream of length x, I need a way to get the current location in the stream (numeric value) the scanner is at, say y which is half x. I tried to do that by accessing some member variables/functions but it always triggered the scan of the next character. How would you suggest acheiving that?
"Buffering (of current token, sic!) is needed for many reasons. For example current word (token) has things like "<" esacaped."
This is exactly what I meant - how can I make this scanner interface directly with a file instead of reading the file myself and then passing its contents as a string? with large files it can boost perfomance.
Stilgar.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Use something like this:
struct ascii_file_istream: public markup::instream { FILE *f; unsigned int pos; ascii_file_istream(const char* filename):pos(0) { .... } virtual wchar_t get_char() { char c; pos++; return fread(&c,1,1,f)?c ; } };
instead of str_istream...
fread is a buffered file i/o so should work acceptable.
-- modified at 16:08 Saturday 3rd June, 2006
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |