-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex doesnt work for UTF8 #147
Comments
Seems to affect any non-ASCII characters.
|
Python isn't affected, using the same regexp:
I'll have to do some experimenting later to see whether this is a JavaScript/ES5 Unicode regexp bug and if ES6 Unicode-aware regular expressions will match this where the old regexps won't. Edit:
Looks like a likely source of the problem. The Chinese symbols you used aren't in the BMP set. |
the characters used are "nihao" = hello, the most basic chinese phrase. is there a reason to not use the u flag? for old browsers? |
You're right, 你 seems to be in the BMP plane in the The var pattern = '(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)';
var messages = [
"a 你好 b",
"你好 你好 你好",
"é 你好 é"
];
for (var i in messages) {
var msg = messages[i];
console.log("Try: " + msg);
var match = msg.match(new RegExp("^" + pattern + "$", "u"));
if (match) {
console.log("Matched!");
}
} So at least we can rule out that being the cause of the problem. It might be the |
Found this: https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters Might be able to replace things like |
just to clarify, this will match this wont |
I was wondering if Rive could enable normal regexes? SuperScript has that capability I think. |
@dcsan I think I may just have to add that feature. What I've learned from porting RiveScript to 5 different languages is that A) Unicode is hard, and B) regular expression engines aren't all created equally. Things that work in regexps in one language don't work in another, and it's hard to make RiveScript support all kinds of Unicode across all versions; so allowing the end user to write a literal regular expression can enable them to fix their specific issues their own way, and avoids all the 'magic' that RiveScript's predecessor supported a regexp command: everything old is new again. |
it would be a neat feature to add, and open up full regexp power as well as especially multilanguages. I didnt know about that old perl version btw regarding the tilde I liked very much superscripts old implementation where you could do things like |
Closing this issue in favor of tracking the |
I'm trying some chinese inputs, and the
[*]
format doesn't seem to behaveso as you can see using normal western code
ww
on the sides of the chinese characters is OK, but the [*] isn't matching if using chinese characters, with or without a space. I also tried the rive pattern without spaces, iealthough i'm not sure what the best practice is here.
FWIW normal * is matching OK:
The text was updated successfully, but these errors were encountered: