-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] make Token::SKIP also skip for parser #27
Comments
@weltling Is this due to mixing UTF-8 with UTF-32? I imagine it would be better to use lexertl UTF-8 iterators to convert the UTF-32 back to UTF-8. |
@lublak Could you provide a self contained test case for this? Then I can work on a fix. Thanks |
@BenHanson with regard to UTF-8 vs UTF-32 - the conversion currently relies on With this issue, however, i'd second the request to @lublak for a proper reproducer, the provided data doesn't seem enough. Thanks |
@weltling Yes, the lexertl utf iterators are platform independent so we should definitely investigate switching over. If UTF-8 is the preferred Unicode format for PHP, then we don't even need a UTF-32 build. We can discuss this further to decide how you want it to work. |
AFAIR the reason to support https://www.php.net/manual/en/parle.regex.unicodecharclass.php These seem to be provided standard C++ lib and are UTF-32 only, if i don't err. Thanks |
What you can't do is have a lexertl state_machine that takes 8 bit characters in order to use Unicode characters. However, input text (including rules) can be UTF-8 and it's possible to convert those strings on the fly in C++ using the lexertl iterators. I can write you a demo if it makes it clearer. |
The conversions already happen at the corresponding places, as mentioned with regard to UTF-8 variant is compiled by default for simplicity. Of course it is possible to support UTF-8 only or UTF-32 supporting build, it would probably require rewriting the internals to carry another type of C++ object inside. BTW same is actually with things like Parser vs. RParser and Lexer vs. RLexer clasess - it could be just one class name and be constructed from PHP, just at the time of creating this it seemed simpler to have different class names or different build. Unifying might be worth it or not, usability or performance wise. Thanks |
I've now got conversion of UTF-32 to UTF-8 up and running and working for parser dumping. |
Very nice. The Thansk |
currently in some sigils there are always the "skipped" data inside:
The text was updated successfully, but these errors were encountered: