-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word tokenizer works only for 7 bit ascii #29
Comments
Handling this may be related to handling #30. A clean solution would require ditching the use of regex, as it is inflexible in the definition of what is a word and what is not and difficult to tinker for that. Will non-regex comparisons make things much slower?? |
@matanster Hmm, what exactly is wrong with regexps in this regard? A couple of these could come in handy btw.: https://github.com/One-com/unicoderegexp/blob/master/lib/unicodeRegExp.js#L16-L23 |
@papandreou I have moved on since, but well regex is very opinionated about what is a word character, which simply doesn't cut it in some cases. It's also quick-and-dirty, my sentiment on that is similar to that expressed at https://github.com/sirthias/parboiled/wiki/RegEx-vs.-parboiled-vs.-Parser-Generators. |
Yes, |
Yes, thanks for the link! but regex is only poorly composable - you would not engineer any other part of your software in the arcane messy way that regex are crafted, unless you write assembly code for fun. Yes that link is a good example of trying to be modular about regex.... in a sense likewise to https://github.com/VerbalExpressions/JSVerbalExpressions. Since Javascript is so nice with function passing, why not have jsdiff let you provide a matching function rather than a regex? it's a standard hallmark of some other javascript libraries (e.g. d3.js to name one) where flexibility has been held in high regard. |
Fair enough to dislike regular expressions, but I'm not sure that I understand the alternative you're proposing? If the job at hand is to make a word tokenizer, and you need to distinguish between all (unicode) letters and non-letters, I don't see how you could do much better than to have something like If your building blocks are functions, you cannot do that trick. |
An alternative is to use xregexp with the unicode addon. Then you can just use Hopefully these features will make it into JavaScript proper so that we'll have real Unicode support in that area as well. |
Released in 2.0.0 |
the word tokenizer for
WordDiff
andWordWithSpaceDiff
uses\b
in its regular expression. that considers word characters as[a-zA-Z0-9_]
, which fails on anything beyond 7 bit.f.e. the german phrase "wir üben" splits to:
replacing the tokenizer with
value.split(/(\s+)/)
is sufficient in my use-case, but i don't have newlines in my text. some further testing needed, i think.further reading:
http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters/10590620#10590620
The text was updated successfully, but these errors were encountered: