regex doesnt work for UTF8 #147

dcsan · 2016-08-02T14:07:40Z

I'm trying some chinese inputs, and the [*] format doesn't seem to behave

    + [*] 你好 [*]
    - wrapping 你好

so as you can see using normal western code ww on the sides of the chinese characters is OK, but the [*] isn't matching if using chinese characters, with or without a space. I also tried the rive pattern without spaces, ie

+ [*]你好[*]

although i'm not sure what the best practice is here.

FWIW normal * is matching OK:

    + 你好
    - 你－》你好<get nickname>

    + 你好 *
    - 你－》<star>

    + [*] 你好 [*]
    - wrapping 你好

The text was updated successfully, but these errors were encountered:

kirsle · 2016-08-02T17:09:25Z

Seems to affect any non-ASCII characters.

Try to match "é 你好 é" against [*] 你好 [*] ((?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+)你好(?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+))
Reply: ERR: No Reply Matched

kirsle · 2016-08-02T17:13:28Z

Python isn't affected, using the same regexp:

[RS] Try to match 'é 你好 é' against '[*] 你好 [*]' ('^(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))$')
[RS] Found a match!
[RS] Reply: wrapping 你好

I'll have to do some experimenting later to see whether this is a JavaScript/ES5 Unicode regexp bug and if ES6 Unicode-aware regular expressions will match this where the old regexps won't.

Edit:

Without the u flag, . matches any BMP symbol except line terminators. When the ES6 u flag is set, . matches astral symbols too.

Looks like a likely source of the problem. The Chinese symbols you used aren't in the BMP set.

dcsan · 2016-08-02T18:02:31Z

Basic Multilingual Plane (BMP). This plane contains most of the characters needed for scripts and languages in routine use in the world today. The plane is nearly filled with only 128 of the 65,534 code points remaining to be allocated.

the characters used are "nihao" = hello, the most basic chinese phrase.
So i don't think that's the problem. you could try the match with some simpler western unicode stuff like umlauts...

is there a reason to not use the u flag? for old browsers?
FWIW i'm running in node5/ES6
i have set utf8 as an option to the rive interpreter.

kirsle · 2016-08-02T18:23:35Z

You're right, 你 seems to be in the BMP plane in the CJK Unified Ideographs, U+4E00 to U+9FFF block.

The u flag raises a syntax error on ES5 engines. It doesn't seem to help on Node 6 though, anyway:

var pattern = '(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)';
var messages = [
    "a 你好 b",
    "你好 你好 你好",
    "é 你好 é"
];

for (var i in messages) {
    var msg = messages[i];
    console.log("Try: " + msg);
    var match = msg.match(new RegExp("^" + pattern + "$", "u"));
    if (match) {
        console.log("Matched!");
    }
}

So at least we can rule out that being the cause of the problem. It might be the \b word boundary sequence in the regexp.

kirsle · 2016-08-02T18:28:45Z

Found this: https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters

Might be able to replace things like (\s|\b) to (\s|^) (maybe also (\s|$)?) and make sure it doesn't break #48 again.

dcsan · 2016-10-15T05:41:20Z

just to clarify, this will match
我怎么说*

this wont
我怎么说[*]

Lewikster · 2017-01-13T05:27:10Z

Did a couple of more testing with Chinese characters:

+ [*]你好[*]
- works

Chinese wild characters + Chinese tigger does NOT work.
English wild letters + Chinese trigger WOKRS

+ [*]你好[*]
- works

Currently an alt to “[]你好[]” is

+ (*你好|你好|你好*|*你好*) 
 - works

dcsan · 2017-01-13T09:03:48Z

I was wondering if Rive could enable normal regexes? SuperScript has that capability I think.

kirsle · 2017-01-19T16:39:11Z

@dcsan I think I may just have to add that feature. What I've learned from porting RiveScript to 5 different languages is that A) Unicode is hard, and B) regular expression engines aren't all created equally. Things that work in regexps in one language don't work in another, and it's hard to make RiveScript support all kinds of Unicode across all versions; so allowing the end user to write a literal regular expression can enable them to fix their specific issues their own way, and avoids all the 'magic' that triggerRegexp() does that might interfere with their attempt to get a working regexp out of it.

RiveScript's predecessor supported a regexp command: everything old is new again.

dcsan · 2017-01-21T02:09:10Z

it would be a neat feature to add, and open up full regexp power as well as especially multilanguages.

I didnt know about that old perl version

btw regarding the tilde I liked very much superscripts old implementation where you could do things like ~emohello and it would expand to match a whole category of phrases (a bit like rivescript arrays but I believe using a much bigger NLP corpus). I think they removed that recently and made users call a function, but that is a nice syntax to reserve the tilde for ~= approx equal

kirsle · 2017-03-10T21:59:20Z

Closing this issue in favor of tracking the ~Regexp feature in aichaos/rivescript-wd#6

kirsle added the bug label Aug 2, 2016

kirsle mentioned this issue Aug 4, 2016

UTF-8 and Optionals aichaos/rivescript-python#37

Closed

kirsle modified the milestone: v1.17.0 Oct 5, 2016

kirsle added the unicode label Dec 16, 2016

kirsle mentioned this issue Feb 7, 2017

Issue with conversations with utf8 aichaos/rivescript-python#78

Closed

kirsle mentioned this issue Mar 10, 2017

Support raw regular expression triggers aichaos/rivescript-wd#6

Open

kirsle closed this as completed Mar 10, 2017

kirsle mentioned this issue Apr 1, 2017

Bug on trigger with cyrrilic language aichaos/rivescript-go#22

Closed

kirsle mentioned this issue Jul 19, 2017

bug on utf-8 aichaos/rivescript#3

Closed

dcsan mentioned this issue Dec 17, 2017

wildcards failing for Japanese #253

Closed

kirsle mentioned this issue Feb 21, 2018

Problem with [*] for non English #254

Closed

kirsle mentioned this issue Mar 4, 2018

Add ?Keyword command to work around Unicode keyword bug #256

Merged

dcsan mentioned this issue Apr 23, 2018

How to write optional word with empty support? #262

Open

kirsle mentioned this issue Jan 4, 2020

Optionals issue with non-English (Arabic) #333

Open

kirsle mentioned this issue Mar 5, 2020

Keyword trigger not matching '[*] sí [*]' #336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex doesnt work for UTF8 #147

regex doesnt work for UTF8 #147

dcsan commented Aug 2, 2016 •

edited

Loading

kirsle commented Aug 2, 2016 •

edited

Loading

kirsle commented Aug 2, 2016 •

edited

Loading

dcsan commented Aug 2, 2016

kirsle commented Aug 2, 2016

kirsle commented Aug 2, 2016

dcsan commented Oct 15, 2016

Lewikster commented Jan 13, 2017

dcsan commented Jan 13, 2017

kirsle commented Jan 19, 2017

dcsan commented Jan 21, 2017

kirsle commented Mar 10, 2017

regex doesnt work for UTF8 #147

regex doesnt work for UTF8 #147

Comments

dcsan commented Aug 2, 2016 • edited Loading

kirsle commented Aug 2, 2016 • edited Loading

kirsle commented Aug 2, 2016 • edited Loading

dcsan commented Aug 2, 2016

kirsle commented Aug 2, 2016

kirsle commented Aug 2, 2016

dcsan commented Oct 15, 2016

Lewikster commented Jan 13, 2017

dcsan commented Jan 13, 2017

kirsle commented Jan 19, 2017

dcsan commented Jan 21, 2017

kirsle commented Mar 10, 2017

dcsan commented Aug 2, 2016 •

edited

Loading

kirsle commented Aug 2, 2016 •

edited

Loading

kirsle commented Aug 2, 2016 •

edited

Loading