Implement (contextual) keywords and use their versioning from v2 #723

Xanewok · 2023-12-27T18:06:21Z

Closes #568

There is still one outstanding issue where we return a Vec<TokenKind> from next_token; it'd like to return a more specialized type and ideally pass it on stack (2x2 bytes), rather than on-heap (extra 3x8 bytes for the Vec handle + indirection). We should name it better and properly show that we can return at most 2 token kinds (single token kind or identifier + kw combo).

To do:

Return tokens from next_token via stack

Apart from that, I think this is a more correct approach than #598, especially accounting for the new keyword definition format in DSL v2.

The main change is that we only check the keyword trie and additionally the (newly introduced) compound keyword scanners only after the token has been lexed as an identifier. For each context, we collect Identifier scanners used by the keywords and attempt promotion there.

The existing lexing performance is not impacted from what I've seen when running the sanctuary tests and I can verify (incl. CST tests) that we now properly parse source that uses contextual keywords (e.g. from) and that the compound keywords (e.g. ufixedMxN) are properly versioned.

This adapts the existing codegen_grammar interface that's a leftover from DSLv1; I did that to work on finishing #638; once this is merged and we now properly parse contextual keywords, I'll move to clean it up and reduce the parser codegen indirection (right now we go from v2 -> v1 model -> code generator -> Tera templates; it'd like to at least cut out the v1 model and/or simplify visiting v2 from the existing CodeGenerator).

Please excuse the WIP comments in the middle; the first and the last ones should make sense when reviewing. I can simplify this a bit for review, if needed.

Also optimize a bit the kw scanner function and remove some comments.

…ntifier

changeset-bot · 2023-12-27T18:06:24Z

🦋 Changeset detected

Latest commit: 4fb350e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@nomicfoundation/slang	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

crates/codegen/parser/generator/src/code_generator.rs

This also removes the `scan` from the NAPI but we wanted to hide that anyways ourselves.

Xanewok · 2024-01-02T13:53:12Z

Instead of next_token returning Vec (first pass), we now introduce the following type:

pub enum ScannedToken {
    Single(TokenKind),
    IdentifierOrKeyword {
        identifier: TokenKind,
        kw: KeywordScan,
    },
}

Not only we don't go through heap for the often-called next_token, we now explicitly have to handle the case of identifiers/keywords and whether a keyword is strictly reserved or "present", so usable in both keyword and identifier position.

To keep the existing parser flow mostly the same, a helper fn ScannedToken::unambiguous(self) -> TokenKind is introduced, that tries to map back to a single token kind if possible (for the "present" keyword it falls back to a general "identifier" token kind).

This also introduces another helper fn ScannedToken::accepted_as(self, expected: TokenKind) -> bool that is used in the parser functions as a convenience, which returns whether the scanned token is accepted in a given position, streamlining the underlying additional kw reservation checks for identifier token kinds.

Xanewok · 2024-01-02T13:55:07Z

crates/codegen/parser/runtime/src/support/scanner_macros.rs

+            $(
+                {
+                    if let result @ (KeywordScan::Present(..) | KeywordScan::Reserved(..)) = ($scanner) {
+                        if $ident.len() == $stream.position().utf8 - save.utf8 {


Not perfect, I admit; this should probably be handled with an automaton for the keywords and executed/matched on the string till it's exhausted but let's punt on this for now

crates/codegen/parser/runtime/src/templates/language.rs.jinja2

crates/codegen/parser/runtime/src/templates/mod.rs.jinja2

crates/solidity/outputs/npm/package/src/generated/index.d.ts

.../solidity/testing/snapshots/cst_output/EventDefinition/transfer/generated/0.4.11-success.yml

...ity/testing/snapshots/cst_output/YulVariableDeclarationStatement/keyword_ufixed8x8/input.sol

OmarTawfik

Left a few suggestions.

@Xanewok

This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and publish to npm yourself or [setup this action to publish automatically](https://github.com/changesets/action#with-publishing). If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @nomicfoundation/slang@0.13.0 ### Minor Changes - [#710](#710) [`2025b6cb`](2025b6c) Thanks [@Xanewok](https://github.com/Xanewok)! - CST children nodes are now named - [#723](#723) [`b3dc6bcd`](b3dc6bc) Thanks [@Xanewok](https://github.com/Xanewok)! - Properly parse unreserved keywords in an identifier position, i.e. `from`, `emit`, `global` etc. - [#728](#728) [`662a672c`](662a672) Thanks [@Xanewok](https://github.com/Xanewok)! - Remove Language#scan API; use the parser API instead - [#719](#719) [`1ad6bb37`](1ad6bb3) Thanks [@OmarTawfik](https://github.com/OmarTawfik)! - introduce strong types for all Solidity non terminals in the TypeScript API. ### Patch Changes - [#719](#719) [`1ad6bb37`](1ad6bb3) Thanks [@OmarTawfik](https://github.com/OmarTawfik)! - unify Rust/TypeScript node helpers: `*_with_kind()`, `*_with_kinds()`, `*_is_kind()`), ... - [#731](#731) [`3deaea2e`](3deaea2) Thanks [@OmarTawfik](https://github.com/OmarTawfik)! - add `RuleNode.unparse()` to the TypeScript API Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Xanewok added 27 commits December 23, 2023 13:33

cleanup: Use BTreeMap for CodeGenerator::{scanner,parser}_functions

e2a7cc5

cleanup: Use BTreeMap for CodeGenerator::scanner_contexts

af43e8f

cleanup: Mark top_level_scanner_names as unused in templates

be32592

cleanup: Hoist the Identifier hack in PG trie code

391c596

WIP: Add a comment

cd84e1a

refactor: Clean up a bit Trie code

6c15e5d

wtf?

9c048ac

cleanup: Introduce a helper CodeGenerator::current_context fn

f3b520e

Deduplicate longest_match in Lexer::next_token

5eeb0bd

WIP

87d407a

WIP2

82d9ef6

WIP3

e2da887

WIP more

08b4396

WIP: Add some more

0746345

Don't always rescan with underlying when trying to scan keywords

7b12d89

Also optimize a bit the kw scanner function and remove some comments.

Make sure the identifier is always scanned as a last compound scanner

3f80727

clean up some bits

9ff1b90

Speed up lexing by only attempting kw promotion if it lexes as an ide…

1c82302

…ntifier

Bring back keyword lookup using trie

05edbe1

Simplify the trie

05f154f

Fix compound keyword promotion and add CST tests

26e55fb

cleanup: remove unnecessary now wrong_self_convention lint

a0dc824

Simplify emitted code for the compound keyword scanners

90d8d88

Remove unnecessary comment

d348dde

cleanup: Remove some WIP code

5180f98

Fix a typo

7de6920

Add more comments

e6b5c15

Xanewok requested a review from a team as a code owner December 27, 2023 18:06

AntonyBlakey requested changes Dec 28, 2023

View reviewed changes

crates/codegen/parser/generator/src/code_generator.rs Outdated Show resolved Hide resolved

Xanewok added 7 commits January 2, 2024 11:27

Hold the scanned kw token kind in the KeywordScan enum

8651b2e

Don't Option-wrap keyword scan results when using a trie

8dc56e8

Introduce ScannedToken to separately handle ident/kw from the scanner

0a615ec

This also removes the `scan` from the NAPI but we wanted to hide that anyways ourselves.

Rename identifier_scanners to identifier_scanner_names

581d611

Clean up a bit the resulting next_token

317ce6f

perf: Only attempt scanning a compound keyword if we didn't find one

93942ac

Add a changeset file

bbea7fe

Xanewok commented Jan 2, 2024

View reviewed changes

crates/codegen/parser/runtime/src/templates/language.rs.jinja2 Outdated Show resolved Hide resolved

Xanewok commented Jan 2, 2024

View reviewed changes

crates/codegen/parser/runtime/src/templates/mod.rs.jinja2 Outdated Show resolved Hide resolved

OmarTawfik reviewed Jan 2, 2024

View reviewed changes

crates/solidity/outputs/npm/package/src/generated/index.d.ts Outdated Show resolved Hide resolved

OmarTawfik reviewed Jan 2, 2024

View reviewed changes

.../solidity/testing/snapshots/cst_output/EventDefinition/transfer/generated/0.4.11-success.yml Show resolved Hide resolved

OmarTawfik reviewed Jan 2, 2024

View reviewed changes

...ity/testing/snapshots/cst_output/YulVariableDeclarationStatement/keyword_ufixed8x8/input.sol Show resolved Hide resolved

OmarTawfik reviewed Jan 2, 2024

View reviewed changes

Xanewok added 5 commits January 3, 2024 14:32

Merge remote-tracking branch 'upstream/main' into keyword-idents-take-2

7b3ec72

Add comments about specific keyword reservation in the CST snapshots

322f105

Rename identifier_scanner_names to promotable_identifier_scanners

2336ad9

Merge remote-tracking branch 'upstream/main' into keyword-idents-take-2

8fe0b94

Add more regression CST tests

4fb350e

AntonyBlakey approved these changes Jan 8, 2024

View reviewed changes

OmarTawfik approved these changes Jan 8, 2024

View reviewed changes

Xanewok added this pull request to the merge queue Jan 8, 2024

Merged via the queue into NomicFoundation:main with commit b3dc6bc Jan 8, 2024
1 check passed

Xanewok deleted the keyword-idents-take-2 branch January 8, 2024 17:27

github-actions bot mentioned this pull request Jan 8, 2024

Bump Slang Version #718

Merged

OmarTawfik mentioned this pull request May 30, 2024

Collect syntax kinds directly from DSL v2 and isolate parser generation logic #991

Merged

Xanewok mentioned this pull request Jun 4, 2024

Reduce allocations in keyword lexing #1001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement (contextual) keywords and use their versioning from v2 #723

Implement (contextual) keywords and use their versioning from v2 #723

Xanewok commented Dec 27, 2023 •

edited

Loading

changeset-bot bot commented Dec 27, 2023 •

edited

Loading

Xanewok commented Jan 2, 2024

Xanewok Jan 2, 2024

OmarTawfik left a comment •

edited

Loading

Implement (contextual) keywords and use their versioning from v2 #723

Implement (contextual) keywords and use their versioning from v2 #723

Conversation

Xanewok commented Dec 27, 2023 • edited Loading

changeset-bot bot commented Dec 27, 2023 • edited Loading

🦋 Changeset detected

Xanewok commented Jan 2, 2024

Xanewok Jan 2, 2024

Choose a reason for hiding this comment

OmarTawfik left a comment • edited Loading

Choose a reason for hiding this comment

Xanewok commented Dec 27, 2023 •

edited

Loading

changeset-bot bot commented Dec 27, 2023 •

edited

Loading

OmarTawfik left a comment •

edited

Loading