Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex doesnt work for UTF8 #147

Closed
dcsan opened this issue Aug 2, 2016 · 11 comments
Closed

regex doesnt work for UTF8 #147

dcsan opened this issue Aug 2, 2016 · 11 comments

Comments

@dcsan
Copy link
Contributor

dcsan commented Aug 2, 2016

I'm trying some chinese inputs, and the [*] format doesn't seem to behave

    + [*] 你好 [*]
    - wrapping 你好

image

so as you can see using normal western code ww on the sides of the chinese characters is OK, but the [*] isn't matching if using chinese characters, with or without a space. I also tried the rive pattern without spaces, ie

+ [*]你好[*]

although i'm not sure what the best practice is here.

FWIW normal * is matching OK:

    + 你好
    - 你-》你好<get nickname>

    + 你好 *
    - 你-》<star>

    + [*] 你好 [*]
    - wrapping 你好

image

@kirsle
Copy link
Member

kirsle commented Aug 2, 2016

Seems to affect any non-ASCII characters.

Try to match "é 你好 é" against [*] 你好 [*] ((?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+)你好(?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+))
Reply: ERR: No Reply Matched

@kirsle
Copy link
Member

kirsle commented Aug 2, 2016

Python isn't affected, using the same regexp:

[RS] Try to match 'é 你好 é' against '[*] 你好 [*]' ('^(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))$')
[RS] Found a match!
[RS] Reply: wrapping 你好

I'll have to do some experimenting later to see whether this is a JavaScript/ES5 Unicode regexp bug and if ES6 Unicode-aware regular expressions will match this where the old regexps won't.

Edit:

Without the u flag, . matches any BMP symbol except line terminators. When the ES6 u flag is set, . matches astral symbols too.

Looks like a likely source of the problem. The Chinese symbols you used aren't in the BMP set.

@kirsle kirsle added the bug label Aug 2, 2016
@dcsan
Copy link
Contributor Author

dcsan commented Aug 2, 2016

Basic Multilingual Plane (BMP). This plane contains most of the characters needed for scripts and languages in routine use in the world today. The plane is nearly filled with only 128 of the 65,534 code points remaining to be allocated.

the characters used are "nihao" = hello, the most basic chinese phrase.
So i don't think that's the problem. you could try the match with some simpler western unicode stuff like umlauts...

is there a reason to not use the u flag? for old browsers?
FWIW i'm running in node5/ES6
i have set utf8 as an option to the rive interpreter.

@kirsle
Copy link
Member

kirsle commented Aug 2, 2016

You're right, 你 seems to be in the BMP plane in the CJK Unified Ideographs, U+4E00 to U+9FFF block.

The u flag raises a syntax error on ES5 engines. It doesn't seem to help on Node 6 though, anyway:

var pattern = '(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)';
var messages = [
    "a 你好 b",
    "你好 你好 你好",
    "é 你好 é"
];

for (var i in messages) {
    var msg = messages[i];
    console.log("Try: " + msg);
    var match = msg.match(new RegExp("^" + pattern + "$", "u"));
    if (match) {
        console.log("Matched!");
    }
}

So at least we can rule out that being the cause of the problem. It might be the \b word boundary sequence in the regexp.

@kirsle
Copy link
Member

kirsle commented Aug 2, 2016

Found this: https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters

Might be able to replace things like (\s|\b) to (\s|^) (maybe also (\s|$)?) and make sure it doesn't break #48 again.

@dcsan
Copy link
Contributor Author

dcsan commented Oct 15, 2016

just to clarify, this will match
我怎么说*

this wont
我怎么说[*]

@kirsle kirsle added the unicode label Dec 16, 2016
@Lewikster
Copy link

Did a couple of more testing with Chinese characters:

+ [*]你好[*]
- works

image

Chinese wild characters + Chinese tigger does NOT work.
English wild letters + Chinese trigger WOKRS

+ [*]你好[*]
- works

image

Currently an alt to “[]你好[]” is

+ (*你好|你好|你好*|*你好*) 
 - works

image

@dcsan
Copy link
Contributor Author

dcsan commented Jan 13, 2017

I was wondering if Rive could enable normal regexes? SuperScript has that capability I think.

@kirsle
Copy link
Member

kirsle commented Jan 19, 2017

@dcsan I think I may just have to add that feature. What I've learned from porting RiveScript to 5 different languages is that A) Unicode is hard, and B) regular expression engines aren't all created equally. Things that work in regexps in one language don't work in another, and it's hard to make RiveScript support all kinds of Unicode across all versions; so allowing the end user to write a literal regular expression can enable them to fix their specific issues their own way, and avoids all the 'magic' that triggerRegexp() does that might interfere with their attempt to get a working regexp out of it.

RiveScript's predecessor supported a regexp command: everything old is new again.

@dcsan
Copy link
Contributor Author

dcsan commented Jan 21, 2017

it would be a neat feature to add, and open up full regexp power as well as especially multilanguages.

I didnt know about that old perl version

btw regarding the tilde I liked very much superscripts old implementation where you could do things like ~emohello and it would expand to match a whole category of phrases (a bit like rivescript arrays but I believe using a much bigger NLP corpus). I think they removed that recently and made users call a function, but that is a nice syntax to reserve the tilde for ~= approx equal

@kirsle
Copy link
Member

kirsle commented Mar 10, 2017

Closing this issue in favor of tracking the ~Regexp feature in aichaos/rivescript-wd#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants