Initial support of context tokens (soft keywords, token value comparison operator) support #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference issues
Explanation
Consider the following JS example:
Previously to support soft keyword
get
the following grammar should be used (fragment):But it's more natural to check token value on parser side without lexer changing:
Context tokens are especially useful for SQL-like languages with ton of soft keywords. And the user's mistake I've encountered most often in ANTLR grammars repository was defining a token in lexer without adding it to
id
parser rule (actually I think there are still a lot of such errors in SQL grammars). Context tokens significantly simplify and reduce grammar size and they are used in other languages as well (C#, Kotlin and other).caseInsensitive
parser optionRegarding SQL grammars:
caseInsensitive
parser option was also implemented:How it's implemented
ANTLR tool
ANTLR tool creates artificial tokens for all tokens defined in context token form. If it encounters multiple context token with the same value and type, the single token is created. For the following grammar the only single
ID_keyword
is created:ANTLR runtime
ANTLR runtime tries to treat the following token from input stream as a context token. If it's also can be treated as normal token, two DFA states are being initialized. It's needed for ambiguities resolving. Consider the following grammar:
With the following input:
When ANTLR runtime takes the first
ID_keyword
token, the DFA is being initialized by both context and normal tokens (ID_keyword
andID
).Actually it works in a way if grammar is defined in the ordinary way:
But resolving is being performed at runtime side.
For performance reason, every context token is placed to a String-Integer map that helps to achieve
O(1)
checking complexity. But further optimizations (kind of caching) can be implemented later.It's worth mention that if the current rule is unambitious (
LL(1)
), ANTLR generates more optimized code withswitch
instead ofadaptivePredict
call. This optimization was also implemented for context tokens. Related example is the following:Also, see other tests in
ContextTokens
subdirectory. But more tests also should be added (for instance, test on errors).Testing
All tests are green (only Java) and it's worth mention the new version can consume generated code from previous versions.
Plans
~ID='keyword'
or `ID!='keyword')