[feature request] make Token::SKIP also skip for parser #27

lublak · 2021-06-18T07:07:00Z

currently in some sigils there are always the "skipped" data inside:

$lexer->push('\s+', Token::SKIP);
$parser->sigil(0); // returns "(   "

BenHanson · 2023-05-10T21:34:12Z

@weltling Is this due to mixing UTF-8 with UTF-32? I imagine it would be better to use lexertl UTF-8 iterators to convert the UTF-32 back to UTF-8.

BenHanson · 2023-05-13T08:36:12Z

@lublak Could you provide a self contained test case for this? Then I can work on a fix.

Thanks

weltling · 2023-05-13T12:24:30Z

@BenHanson with regard to UTF-8 vs UTF-32 - the conversion currently relies on libstdc++, please see here lib/parle/cvt.hpp . I think Windows is not quite happy there, but that's probably the least of an issue. The conversion between the PHP engine using UTF-8 and lexertl/parsertl using UTF-32 is don at the corresponding places, can grep them by using the macros as in the header. But if you think moving away from the std functionality to the custom one would bring a better result, it's of course open to change.

With this issue, however, i'd second the request to @lublak for a proper reproducer, the provided data doesn't seem enough.

Thanks

BenHanson · 2023-05-13T20:05:00Z

@weltling Yes, the lexertl utf iterators are platform independent so we should definitely investigate switching over.

If UTF-8 is the preferred Unicode format for PHP, then we don't even need a UTF-32 build. We can discuss this further to decide how you want it to work.

weltling · 2023-05-14T23:01:10Z

AFAIR the reason to support UTF-32 is that otherwise some regex options are not available. At least these come in question:

https://www.php.net/manual/en/parle.regex.unicodecharclass.php

These seem to be provided standard C++ lib and are UTF-32 only, if i don't err.

Thanks

BenHanson · 2023-05-15T08:28:27Z

What you can't do is have a lexertl state_machine that takes 8 bit characters in order to use Unicode characters.

However, input text (including rules) can be UTF-8 and it's possible to convert those strings on the fly in C++ using the lexertl iterators.

I can write you a demo if it makes it clearer.

weltling · 2023-05-15T22:00:16Z

The conversions already happen at the corresponding places, as mentioned with regard to ./lib/parle/cvt.hpp currently using the functionality from <codecvt>. The conversion happens at runtime. It's just with the current design, to have UTF-32 support a special build is required. Apart from that, switching to lexertl iterators instead of codecvt is sure something that can be considered. With PHP side it is always UTF-8, whether lexertl/parsertl internally support UTF-32 or not.

UTF-8 variant is compiled by default for simplicity. Of course it is possible to support UTF-8 only or UTF-32 supporting build, it would probably require rewriting the internals to carry another type of C++ object inside. BTW same is actually with things like Parser vs. RParser and Lexer vs. RLexer clasess - it could be just one class name and be constructed from PHP, just at the time of creating this it seemed simpler to have different class names or different build. Unifying might be worth it or not, usability or performance wise.

Thanks

BenHanson · 2023-05-18T07:24:36Z

I've now got conversion of UTF-32 to UTF-8 up and running and working for parser dumping.

weltling · 2023-05-19T02:24:08Z

Very nice. The char32_t issue is what i've seen as well as a blocker for the UTF-32 part.

Thansk

lublak closed this as completed Jun 18, 2021

lublak reopened this Jun 18, 2021

weltling mentioned this issue May 19, 2023

Added readBison() #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] make Token::SKIP also skip for parser #27

[feature request] make Token::SKIP also skip for parser #27

lublak commented Jun 18, 2021

BenHanson commented May 10, 2023

BenHanson commented May 13, 2023

weltling commented May 13, 2023

BenHanson commented May 13, 2023 •

edited

Loading

weltling commented May 14, 2023

BenHanson commented May 15, 2023

weltling commented May 15, 2023

BenHanson commented May 18, 2023 •

edited

Loading

weltling commented May 19, 2023

[feature request] make Token::SKIP also skip for parser #27

[feature request] make Token::SKIP also skip for parser #27

Comments

lublak commented Jun 18, 2021

BenHanson commented May 10, 2023

BenHanson commented May 13, 2023

weltling commented May 13, 2023

BenHanson commented May 13, 2023 • edited Loading

weltling commented May 14, 2023

BenHanson commented May 15, 2023

weltling commented May 15, 2023

BenHanson commented May 18, 2023 • edited Loading

weltling commented May 19, 2023

BenHanson commented May 13, 2023 •

edited

Loading

BenHanson commented May 18, 2023 •

edited

Loading