|
Comments and Discussions
|
 |
 |
Just to let people know that we reused much of this great scanner code to implement a small and fast scanner that can be called from serverside JavaScript environment called Node.
Using it becomes very easy, as the following snippet shows:
var scanner = new Scanner("<div>content</div>");
do {
token = scanner.next();
console.dir(token);
} while (token[0]);
You can find the source code at: https://github.com/jbaron/htmlscanner
P.S Tested only on Linux and not yet other platforms. To get it compiling on Windows you also need to get de relevant Node stuff up and running which could be a challenge.
regards,
JBaron
|
|
|
|
|
 |
Error 4 error LNK2001: unresolved external symbol "private: enum markup::scanner::token_type __thiscall markup::scanner::scan_body(void)" (?scan_body@scanner@markup@@AAE?AW4token_type@12@XZ)
|
|
|
|
 |
Did you include xh_scanner.cpp in your project?
|
|
|
|
 |
Hi,
Here is a try to make this work for UTF-8 encoded text.
I got the if's for different length UTF-8 multibyte characters from
wikipedia explaining UTF-8 encoding:
struct str_istream: public markup::instream
{
const char* p;
const char* end;
str_istream(const char* src): p(src), end(src + strlen(src)) {}
virtual wchar_t get_char(); };
int mbtowc_Utf8( wchar_t *wchar, const char *mbchar, size_t count)
{
int res = ::MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, mbchar, count, wchar, 1);
if (res <= 0)
{
res = -1;
}
return res;
}
wchar_t str_istream::get_char()
{
if (p < end)
{
const char* ps = p;
if ((0xE0 & *p) == 0xC0)
{
p++;
if (p >= end) return 0;
p++;
wchar_t wch;
if (mbtowc_Utf8(&wch, ps, 2) == -1)
{
return '?';
}
return wch;
}
else if ((0xF0 & *p) == 0xE0)
{
p++;
if (p >= end) return 0;
p++;
if (p >= end) return 0;
p++;
wchar_t wch;
if (mbtowc_Utf8(&wch, ps, 3) == -1)
{
return '?';
}
return wch;
}
else if ((0xF8 & *p) == 0xF0)
{
p++;
if (p >= end) return 0;
p++;
if (p >= end) return 0;
p++;
if (p >= end) return 0;
p++;
wchar_t wch;
if (mbtowc_Utf8(&wch, ps, 4) == -1)
{
return '?';
}
return wch;
}
else
{
return *p++;
}
}
else
{
return 0;
}
}
Here is a sample text that I made a try with:
const char* inp =
"<?xml version=\"1.0\" encoding=\"utf-8\"?>"
"<!-- Generator: Adobe Illustrator 9.0, SVG Export Plug-In -->"
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 20000303 Stylable//EN\" \"http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd\" ["
" <!ENTITY st0 \"fill:#E61408;\">"
" <!ENTITY st1 \"fill:#1C1585;\">"
"]>"
"<svg>text a <UTF8NODE ucAtr=AB\xef\xbb\xbf\xd8\xa7\xd9\x84\xd9\x87 /></svg>";
|
|
|
|
 |
Here is what can be used for utf8 input stream implementation.
bytes here is simple struct
struct bytes
{
byte* start;
uint length;
}
Function wchar getc_utf8(const bytes& buf, int& pos) does the job of converting input sequence of bytes into single UCS-2 (16bit) codepoint - basic multilingual plane subset of big UNICODE.
inline uint get_next_utf8(unsigned int val)
{
assert((val & 0xc0) == 0x80);
return (val & 0x3f);
}
inline unsigned int getb(const bytes& buf, int& pos)
{
if( uint(pos) >= buf.length )
return 0;
return buf[pos++];
}
wchar getc_utf8(const bytes& buf, int& pos)
{
unsigned int b1;
bool isSurrogate = false;
b1 = getb(buf,pos);
if(!b1)
return 0;
isSurrogate = false;
if ((b1 & 0x80) == 0)
{
return (wchar)b1;
}
else if ((b1 & 0xe0) == 0xc0)
{
uint r = (b1 & 0x1f) << 6;
r |= get_next_utf8(getb(buf,pos));
return (wchar)r;
}
else if ((b1 & 0xf0) == 0xe0)
{
uint r = (b1 & 0x0f) << 12;
r |= get_next_utf8(getb(buf,pos)) << 6;
r |= get_next_utf8(getb(buf,pos));
return (wchar)r;
}
else if ((b1 & 0xf8) == 0xf0)
{
isSurrogate = true;
return L'?';
}
else
{
assert(0);
return L'?';
}
}
UCS-2 strings are used by Windows (so called LPCWSTR) prior XP. In Windows XP LPCWSTR is UTF-16 string.
Welcome to UTFs!
|
|
|
|
 |
Hi,
This is the code I'm using in the sample app:
struct ascii_file_istream : public markup::instream
{
FILE *f;
unsigned int pos;
ascii_file_istream(const char* filename) : pos(0), f(NULL) { f = fopen(filename, "rb"); }
virtual wchar_t get_char() { wchar_t c; pos++; return fread(&c;,sizeof(wchar_t),1,f)? c : 0; }
~ascii_file_istream() { fclose(f); }
bool is_file() { return (!(f==NULL)); }
};
int main(int argc, char* argv[])
{
ascii_file_istream fi("c:\\testfile.htm");
if (!fi.is_file())
return 0;
markup::scanner sc(fi);
bool in_text = false;
while(true)
{
int t = sc.get_token();
switch(t)
{
case markup::scanner::TT_EOF:
printf("EOF\n");
goto FINISH;
case markup::scanner::TT_SPACE:
printf("SPACE\n");
break;
case markup::scanner::TT_WORD:
{
const markup::wchar* w = sc.get_value();
printf("WORD: {%S}\n", sc.get_value());
}
break;
// The rest of the cases
// ...
}
FINISH:
printf("--------------------------\n");
return 0;
}
This code works perfectly well when used to scan English documents. It is not working when I'm using it to scan documents with non-english words. The only way I could make it work is by setting the default charset to Unicode in the project properties, re-saving the file as Unicode with Notepad (UTF8 didn't work to, only Unicode), only then would w in the TT_WORD case got the correct value and not a set of squares. Also note I'm reading wchar_t from the file, not char as you suggested in the original code you posted a while ago. I never got printf (nor wprintf) to output the correct chars to the screen, even when the word was read allright.
My question is that: how can I read the non-english file correctly also when it is saved as UTF8 or ANSI? does the scanner itself aware of those characters, or does it just ignore them so once I'm calling sc.get_value() I can convert it to whatever encoding I need and it will display it fine?
Also, can I still read the file with your code:
virtual wchar_t get_char() { char c; pos++; return fread(&c;,1,1,f)? c : 0; }
and still read the non-english words correctly?
The best practice would be for the reader to be indepndant of the file encoding. Is that possible?
Please advise.
Stilgar.
|
|
|
|
 |
Here is what I use for parsing UTF8 encoded streams:
class mem_utf8_istream: public markup::instream
{
bytes buf;
int pos;
public:
mem_istream(bytes text) : buf(text), pos(0) { }
virtual wchar_t get_char() { return getc_utf8(buf, pos); }
};
Where getc_utf8 is this:
inline uint get_next_utf8(unsigned int val)
{
assert((val & 0xc0) == 0x80);
return (val & 0x3f);
}
inline unsigned int getb(const bytes& buf, int& pos)
{
if( uint(pos) >= buf.length )
return 0;
return buf[pos++];
}
wchar getc_utf8(const bytes& buf, int& pos)
{
unsigned int b1;
bool is_surrogate = false;
b1 = getb(buf,pos);
if(!b1)
return 0;
if ((b1 & 0x80) == 0)
{
return (wchar)b1;
}
else if ((b1 & 0xe0) == 0xc0)
{
uint r = (b1 & 0x1f) << 6;
r |= get_next_utf8(getb(buf,pos));
return (wchar)r;
}
else if ((b1 & 0xf0) == 0xe0)
{
uint r = (b1 & 0x0f) << 12;
r |= get_next_utf8(getb(buf,pos)) << 6;
r |= get_next_utf8(getb(buf,pos));
return (wchar)r;
}
else if ((b1 & 0xf8) == 0xf0)
{
is_surrogate = true;
return L'?';
}
else
{
assert(0);
return L'?';
}
}
|
|
|
|
 |
Hi c-smile,
Great parser - it's exactly what i need.
when I'm following your notes regarding UTF8 problems pop-up.
few questions:
1. the name of the class & the constructor is different - i assume it is typo and should be - mem_utf8_istream(bytes text) : buf(text), pos(0) { }
right?!?!
2. when adding this class, i get compiler error for "virtual wchar_t get_char()... " which says that - 'getc_utf8': identifier not found.
when i add the prototype into instream struct, i hit compiler link error.
can you please clarify what is missing or maybe update the code with UTF8 support.
Thanks!
|
|
|
|
 |
did somebody make it work for utf-8?
|
|
|
|
 |
Here is ucode_from_utf8(bytes& buf) function that reads one wide char from sequence of bytes. You can use it to implement input stream that reads utf-8 encoded markup stream.
struct bytes {
const unsigned char* start;
size_t length;
};
inline uint get_next_utf8(unsigned int val)
{
assert((val & 0xc0) == 0x80);
return (val & 0x3f);
}
inline unsigned int getb(bytes& buf)
{
if( buf.length == 0)
return 0;
unsigned char b = *buf.start;
++buf.start;
--buf.length;
return b;
}
uint ucode_from_utf8(bytes& buf)
{
unsigned int b1;
b1 = getb(buf);
if(!b1)
return 0;
if ((b1 & 0x80) == 0)
{
return (wchar)b1;
}
else if ((b1 & 0xe0) == 0xc0)
{
uint r = (b1 & 0x1f) << 6;
r |= get_next_utf8(getb(buf));
return (wchar)r;
}
else if ((b1 & 0xf0) == 0xe0)
{
uint r = (b1 & 0x0f) << 12;
r |= get_next_utf8(getb(buf)) << 6;
r |= get_next_utf8(getb(buf));
return (wchar)r;
}
else if ((b1 & 0xf8) == 0xf0)
{
int b2 = get_next_utf8(getb(buf));
int b3 = get_next_utf8(getb(buf));
int b4 = get_next_utf8(getb(buf));
return ((b1 & 7) << 18) | ((b2 & 0x3f) << 12) |
((b3 & 0x3f) << 6) | (b4 & 0x3f);
}
else
{
assert(0);
return L'?';
}
}
|
|
|
|
 |
This fragment causes problems: <?xml version="1.0" encoding="utf-8"?> <!-- Generator: Adobe Illustrator 9.0, SVG Export Plug-In --> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" [ <!ENTITY st0 "fill:#E61408;"> <!ENTITY st1 "fill:#1C1585;"> ]> This is parsed as: TAG START: ?xml TT_ATTR: version 1.0 TT_ATTR: encoding utf-8 TT_ATTR: ? TT_DATA: ? Generator: Adobe Illustrator 9.0, SVG Export Plug-In TAG START: !DOCTYPE TT_ATTR: svg TT_ATTR: PUBLIC TT_ATTR: "-//W3C//DTD TT_ATTR: SVG TT_ATTR: 20000303 TT_ATTR: Stylable//EN" TT_ATTR: "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" TT_ATTR: [ TT_ATTR: !ENTITY TT_ATTR: st0 TT_ATTR: "fill:#E61408;" TT_ENTITY_START: !ENTITY TT_DATA: "fill:#E61408;" st1 "fill:#1C1585;" TT_ENTITY_END: !ENTITY
Any thoughts on the best way to integrate fix? Perhaps by having a dedicated scan_doctype() ? TIA Jerry
|
|
|
|
 |
Thanks for that, I've updated distribution at:
http://www.terrainformatica.com/org/xh_scanner_demo.zip
Scanning loop looks like this now:
while(true)
{
int t = sc.get_token();
switch(t)
{
case markup::scanner::TT_ERROR:
printf("ERROR\n");
break;
case markup::scanner::TT_EOF:
printf("EOF\n");
goto FINISH;
case markup::scanner::TT_TAG_START:
printf("TAG START:%s\n", sc.get_tag_name());
break;
case markup::scanner::TT_TAG_END:
printf("TAG END:%s\n", sc.get_tag_name());
break;
case markup::scanner::TT_ATTR:
printf("\tATTR:%s=%S\n", sc.get_attr_name(), sc.get_value());
break;
case markup::scanner::TT_WORD:
case markup::scanner::TT_SPACE:
printf("{%S}\n", sc.get_value());
break;
case markup::scanner::TT_PI_START:
printf("\tPI");
break;
case markup::scanner::TT_PI_END:
printf("\n");
break;
case markup::scanner::TT_DOCTYPE_START:
printf("\tDOCTYPE");
break;
case markup::scanner::TT_DOCTYPE_END:
printf("\n");
break;
case markup::scanner::TT_DATA:
printf("[%S]", sc.get_value());
break;
}
|
|
|
|
 |
Andrew, that is very helpful but there is still a problem recognising the first entity in the DOCTYPE section. The output below is from parsing the original doctype example.
TT_DOCTYPE_START
TT_DATA: svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd" [
<!ENTITY st0 "fill:#E61408;"
(Implied)TT_DOCTYPE_END
|
|
|
|
 |
I see.
I have updated sources again with the fix.
Scanner is not trying to parse content of DOCTYPE - it just passes content "as is" to the caller.
Thus scanner is not doing any DTD parsing. That is out of scope of the scanner anyway.
If someone will want to implement such parsing/support - let me know.
|
|
|
|
 |
Many thanks. For my purposes parsing the DTD is overkill but it is important to be ale to extract entity definitions correctly. This is an excellent tool for the job.
|
|
|
|
 |
OK, the current method is fine as the entities in the doctype scope can be parsed separately. one trivial suggestion: For clarity: how about adding add TT_DOCTYPE_DATA enumeration which is returned by scanner::scan_doctype_decl()? this makes client code a tiny bit simpler and intent is clearer.
Thanks again.
|
|
|
|
 |
It returns now one or more TT_DATAs tokens.
In principle typical text->DOM parser should have something like this:
switch( token_stream.get_token() )
{
case TT_DOCTYPE_START:
parse_DOCTYPE(token_stream);
break;
case TT_COMMENT_START:
...
}
where parse_DOCTYPE() in its turn shall have inside:
while(1)
switch( token_stream.get_token() )
{
case TT_ENTITY_DECL_START:
parse_ENTITY_decl(token_stream);
break;
case TT_ATTR_DECL_START:
parse_ATTR_decl(token_stream);
break;
...
}
But again this scanner was designed for cases when "linear" XML/HTML scanning is required. So DOCTYPE and local DTD parsing was out of scope.
Typical use ccase: HTML -> plain text converter.
Another example: we use customized version of the scanner for DOM-less SVG rendering in htmlayout/sciter.
It scans SVG (some subset of) and draws elements as they appear in the source. Without building SVG DOM.
I've commented out TT_ENTITY_DECL_START handling. It shall be enabled if someone will decide to build full parser.
Things like TT_ATTR_DECL_START can be added in the same way as TT_ENTITY_DECL_START.
|
|
|
|
 |
Many thanks Andrew,
you have enabled me to cross another task off my todo list. Gets my 5.
Can you clarify licensing please?
Thx++
Jerry.
|
|
|
|
 |
I am publishing all my stuff with BSD license.
So is this one too has BSD license.
|
|
|
|
|
 |
I've spotted what appears to be another problem with your very useful markup scanner.
Sometimes an attribute-value might be a quoted-url that includes a querystring:
<input value="http://mysite?foo&bar;" type=hidden name=foobar>
The scanner sees the '&' and calls scan_entity(). That function will read up to 31 chars looking for the terminating ';'. When it doesn't find it, it simply appends those 31 chars to the value of the attribute. In the case above, that means the terminating '"' is passed over...
My solution is good enuf for my purpose but isn't perfect: I pass a delimiter to scan_entity() which it also checks-for when collecting chars into its buf[]. The for() loop winds up looking like:
<font face=consolas>...
for(; i < 31 ; ++i )
{
t = get_char();
if(t == 0) return TT_EOF;
if (delim == ' ' && is_whitespace(t) || t == delim) {
push_back(t);
append_value('&');
for(int n = 0; n < i-1; ++n)
append_value(buf[n]);
return buf[i-1];
}
buf[i] = char(t);
if(t == ';')
break;
}</font>
the 4 calls to scan_entity() become
<font face=consolas>...
scan_entity(0);
...
scan_entity('"');
...
scan_entity('\'');
...
scan_entity(' ');
</font>
-scott
|
|
|
|
 |
Thanks dr3d.
In fact fragment at xh_scanner.cpp (126),
needs to be fixed as this:
else do
{
if( is_whitespace(c) ) return TT_ATTR;
if( c == '>' ) { push_back(c); return TT_ATTR; }
append_value(c);
} while(c = get_char());
I also slightly updated scan_entity().
I've sent updates to codeproject people so these changes will appear soon in source.
|
|
|
|
 |
Nice lightweight parser, thanks for sharing.
I think I may have found a bug: if a nested tag begins and ends, get_tag_name() still returns that nested tag. For example: "Text bold text " For each word, the parser says that its tag is(respectively): "p b b b" This is incorrect, as it should output "p b b p"
-Tyson
|
|
|
|
 |
get_tag_name() returns actual value only for TT_TAG_START and TT_TAG_END tokens.
Otherwise tokenizer shall maintain stack of elements. That require memory allocations - thing I was trying to avoid in scanner for many reasons.
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
Type | Article |
Licence | BSD |
First Posted | 11 May 2006 |
Views | 153,936 |
Downloads | 1,485 |
Bookmarked | 93 times |
|
|