feat: optimize unoptimizable token types #2072
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, when the lexer cannot apply the "first character" optimization to any token type, it will use an un-optimized algorithm for all token types. This behavior can also be seen when using the
safeMode: true
property on lexers. While this definitely makes sense in general, this leads to major performance regressions (possibly in the order of magnitudes) in case the token type cannot be equipped with thestart_chars_hint
field (like for unicode regexp).This PR applies a major optimization in case the "first character" optimization fails: When encountering a non-optimizable token type the lexer will now attempt to match this token type on any tokenization attempt in addition to the optimized token types. More generally, the lexer will attempt all non-optimizable token types together with the optimized token types in the order in which they appear (this is important as we likely get unexpected behavior otherwise!).
If there is no optimized token type for the character at the current position, the lexer will now attempt to match all non-optimizable token types, only yielding a lexer error in case non of them match.
This new behavior mirrors the current behavior perfectly, and none of the existing tests can actually determine that something has changed in the lexer behavior. This change is a pure performance improvement for non-optimizable token types. Of course I added new tests to assert that the new change behaves as expected. Furthermore, using
ensureOptimizations: true
when creating a lexer still throws an error when attempting to use non-optimizable tokens, as it's not a perfect optimization (the best optimization is still to use thestart_chars_hint
for those tokens).I haven't written a benchmark for Chevrotain yet to tell whether this yields the expected performance improvements for non-optimizable tokens, but I can already tell that it does not decrease performance for the normal use case (I'm not sure why the performance for JSON increases, I believe the first test that I run is always a bit slower, even when running everything multiple times):
Note that currently
lexer.ts
does not have a 100% test coverage. My new tests cover the newly added/changed lines, but do not increase the test coverage to 100%.