feat: optimize unoptimizable token types #2072

msujew · 2024-10-30T13:46:32Z

Currently, when the lexer cannot apply the "first character" optimization to any token type, it will use an un-optimized algorithm for all token types. This behavior can also be seen when using the safeMode: true property on lexers. While this definitely makes sense in general, this leads to major performance regressions (possibly in the order of magnitudes) in case the token type cannot be equipped with the start_chars_hint field (like for unicode regexp).

This PR applies a major optimization in case the "first character" optimization fails: When encountering a non-optimizable token type the lexer will now attempt to match this token type on any tokenization attempt in addition to the optimized token types. More generally, the lexer will attempt all non-optimizable token types together with the optimized token types in the order in which they appear (this is important as we likely get unexpected behavior otherwise!).

If there is no optimized token type for the character at the current position, the lexer will now attempt to match all non-optimizable token types, only yielding a lexer error in case non of them match.

This new behavior mirrors the current behavior perfectly, and none of the existing tests can actually determine that something has changed in the lexer behavior. This change is a pure performance improvement for non-optimizable token types. Of course I added new tests to assert that the new change behaves as expected. Furthermore, using ensureOptimizations: true when creating a lexer still throws an error when attempting to use non-optimizable tokens, as it's not a perfect optimization (the best optimization is still to use the start_chars_hint for those tokens).

I haven't written a benchmark for Chevrotain yet to tell whether this yields the expected performance improvements for non-optimizable tokens, but I can already tell that it does not decrease performance for the normal use case (I'm not sure why the performance for JSON increases, I believe the first test that I run is always a bit slower, even when running everything multiple times):

Note that currently lexer.ts does not have a 100% test coverage. My new tests cover the newly added/changed lines, but do not increase the test coverage to 100%.

bd82 · 2024-11-04T16:48:25Z

Thanks @msujew I will look at this soon (hopefully on the weekend...)

feat: optimize unoptimizable token types

e1746f8

msujew requested a review from bd82 October 30, 2024 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize unoptimizable token types #2072

feat: optimize unoptimizable token types #2072

msujew commented Oct 30, 2024 •

edited

Loading

bd82 commented Nov 4, 2024

feat: optimize unoptimizable token types #2072

Are you sure you want to change the base?

feat: optimize unoptimizable token types #2072

Conversation

msujew commented Oct 30, 2024 • edited Loading

bd82 commented Nov 4, 2024

msujew commented Oct 30, 2024 •

edited

Loading