Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] make Token::SKIP also skip for parser #27

Open
lublak opened this issue Jun 18, 2021 · 9 comments
Open

[feature request] make Token::SKIP also skip for parser #27

lublak opened this issue Jun 18, 2021 · 9 comments

Comments

@lublak
Copy link

lublak commented Jun 18, 2021

currently in some sigils there are always the "skipped" data inside:

$lexer->push('\s+', Token::SKIP);
$parser->sigil(0); // returns "(   "
@lublak lublak closed this as completed Jun 18, 2021
@lublak lublak reopened this Jun 18, 2021
@BenHanson
Copy link
Collaborator

@weltling Is this due to mixing UTF-8 with UTF-32? I imagine it would be better to use lexertl UTF-8 iterators to convert the UTF-32 back to UTF-8.

@BenHanson
Copy link
Collaborator

@lublak Could you provide a self contained test case for this? Then I can work on a fix.

Thanks

@weltling
Copy link
Owner

@BenHanson with regard to UTF-8 vs UTF-32 - the conversion currently relies on libstdc++, please see here lib/parle/cvt.hpp . I think Windows is not quite happy there, but that's probably the least of an issue. The conversion between the PHP engine using UTF-8 and lexertl/parsertl using UTF-32 is don at the corresponding places, can grep them by using the macros as in the header. But if you think moving away from the std functionality to the custom one would bring a better result, it's of course open to change.

With this issue, however, i'd second the request to @lublak for a proper reproducer, the provided data doesn't seem enough.

Thanks

@BenHanson
Copy link
Collaborator

BenHanson commented May 13, 2023

@weltling Yes, the lexertl utf iterators are platform independent so we should definitely investigate switching over.

If UTF-8 is the preferred Unicode format for PHP, then we don't even need a UTF-32 build. We can discuss this further to decide how you want it to work.

@weltling
Copy link
Owner

AFAIR the reason to support UTF-32 is that otherwise some regex options are not available. At least these come in question:

https://www.php.net/manual/en/parle.regex.unicodecharclass.php

These seem to be provided standard C++ lib and are UTF-32 only, if i don't err.

Thanks

@BenHanson
Copy link
Collaborator

What you can't do is have a lexertl state_machine that takes 8 bit characters in order to use Unicode characters.

However, input text (including rules) can be UTF-8 and it's possible to convert those strings on the fly in C++ using the lexertl iterators.

I can write you a demo if it makes it clearer.

@weltling
Copy link
Owner

The conversions already happen at the corresponding places, as mentioned with regard to ./lib/parle/cvt.hpp currently using the functionality from <codecvt>. The conversion happens at runtime. It's just with the current design, to have UTF-32 support a special build is required. Apart from that, switching to lexertl iterators instead of codecvt is sure something that can be considered. With PHP side it is always UTF-8, whether lexertl/parsertl internally support UTF-32 or not.

UTF-8 variant is compiled by default for simplicity. Of course it is possible to support UTF-8 only or UTF-32 supporting build, it would probably require rewriting the internals to carry another type of C++ object inside. BTW same is actually with things like Parser vs. RParser and Lexer vs. RLexer clasess - it could be just one class name and be constructed from PHP, just at the time of creating this it seemed simpler to have different class names or different build. Unifying might be worth it or not, usability or performance wise.

Thanks

@BenHanson
Copy link
Collaborator

BenHanson commented May 18, 2023

I've now got conversion of UTF-32 to UTF-8 up and running and working for parser dumping.

@weltling
Copy link
Owner

Very nice. The char32_t issue is what i've seen as well as a blocker for the UTF-32 part.

Thansk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants