Unable to match unicode chars #138

mm-lemainque · 2024-12-29T17:31:48Z

Hello,

It seems the current implementation is not able to match unicode chars. My goal is to build a terminal accepting any letter including accents, such [a-zA-Zà-ÿÀ-ß0-9 ]+, as GBNF seems to support it

Simple code to reproduce with xgrammar 1.8.0:

>>> import xgrammar.testing
>>> xgrammar.testing._is_grammar_accept_string("root ::= [é]", "é", True)
/workspace/cpp/matcher_base.cc:99: Matching char: 195 "\xc3"
/workspace/cpp/matcher_base.cc:101: Previous stack: Stacks tops size: 1
Stack #0: {
id: 0, RulePosition: rule 0: root, sequence 1: ("\xe9"), element id: 0, element in string: 0, parent id: -1, ref count: 1
}
/workspace/cpp/matcher_base.cc:131: Character 195 "\xc3" Rejected
/workspace/cpp/matcher.cc:401: Matching failed after accepting 0 characters
False

I also tried with a pseudo-wildcard rule:

>>> xgrammar.testing._is_grammar_accept_string("root ::= [\\x00-\\xff]", "é", True)
/workspace/cpp/matcher_base.cc:99: Matching char: 195 "\xc3"
/workspace/cpp/matcher_base.cc:101: Previous stack: Stacks tops size: 1
Stack #0: {
id: 0, RulePosition: rule 0: root, sequence 1: ([\0-\xff]), element id: 0, left utf8 bytes: 0, parent id: -1, ref count: 1
}
/workspace/cpp/matcher_base.cc:131: Character 195 "\xc3" Rejected
/workspace/cpp/matcher.cc:401: Matching failed after accepting 0 characters
False

Unless you know a workaround, I would be happy to help solving this

Thanks a lot for your help

The text was updated successfully, but these errors were encountered:

mm-lemainque · 2024-12-30T13:00:06Z

Removing the below code solves the issue and all tests are passing

xgrammar/cpp/matcher_base.cc

Lines 32 to 34 in 937f3c4

    
           if (num_bytes > 1) { 
        
             return is_negative; 
        
           }

EDIT: a proper fix should rather be to deal with multi-byte chars in kCharacterClass rules

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to match unicode chars #138

Unable to match unicode chars #138

mm-lemainque commented Dec 29, 2024 •

edited

Loading

mm-lemainque commented Dec 30, 2024 •

edited

Loading

Unable to match unicode chars #138

Unable to match unicode chars #138

Comments

mm-lemainque commented Dec 29, 2024 • edited Loading

mm-lemainque commented Dec 30, 2024 • edited Loading

mm-lemainque commented Dec 29, 2024 •

edited

Loading

mm-lemainque commented Dec 30, 2024 •

edited

Loading