You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a "Unicode-bug" in nersuite_common/tokenizer.cpp, Tokenizer::find_token_end: If the isalnum(int) test inside that method fails, the token created is always one single byte wide (because it then returns beg + 1, and beg is a size_t).
This means that any multibyte encoded texts, such as all UTFs, cannot be correctly tokenized by this tool (with the logical exception of ASCII-only containing UTF-8, naturally) because it splits [more than one byte] wide characters in two or more tokens.
This is even nastier in the case of UTF-8 encoded text, because the bug is non-obvious and only becomes apparent when special characters like non-ASCII dashes or Greek letters are present in the text.
The text was updated successfully, but these errors were encountered:
At the beginning of developing this application, we used a pre-processing program that converts Unicode characters to Ascii characters.
It is not completely same program but you can find one from https://github.com/spyysalo/unicode2ascii
I also would like to make NERsuite to handle multibyte input since non-ascii characters virtually appear in all biomedical texts.
Unfortunately, it will take some time, at least a few months, to make a time for this improvement because I am currently preparing my thesis defense presentation that will be at the beginning of Feb.
There is a "Unicode-bug" in
nersuite_common/tokenizer.cpp
,Tokenizer::find_token_end
: If theisalnum(int)
test inside that method fails, the token created is always one single byte wide (because it then returnsbeg + 1
, and beg is asize_t
).This means that any multibyte encoded texts, such as all UTFs, cannot be correctly tokenized by this tool (with the logical exception of ASCII-only containing UTF-8, naturally) because it splits [more than one byte] wide characters in two or more tokens.
This is even nastier in the case of UTF-8 encoded text, because the bug is non-obvious and only becomes apparent when special characters like non-ASCII dashes or Greek letters are present in the text.
The text was updated successfully, but these errors were encountered: