Decouple parser from PNode with DOD approach #423

saem · 2022-08-31T19:57:05Z

Upgraded parser (and possibly lexer/tokenizer) that conforms to a more data oriented design approach as taken in the VM and new Backend (in progress). The semantic analysis layer won't be able to take advantage of it yet, but other tools like filters and pretty might.

The approach at the out set:

decouple the parser from PNode
improve parser data types to have a more data oriented approach and get it working
look at the DOD approach and see what rough edges/misfeatures need fixing, replacement, or removal; do some of those

NB: like any plan revise and improve the approach as we learn more. At the end we should have a faster parser which is nicer to memory, with more tests/validation against the grammar, fewer oddities in the grammar, clearer node kinds, decoupling from semantic analysis, and a path for tools like source filters, formatters, etc, easily making effective use of the new parser.

Deets

Decouple the parser from `PNode`

The parser is coupled to PNode right now which takes a sea of nodes on the heap approach. First step make a ParsedNode (or whatever) that's basically a very slim PNode, add a proc to convert to PNode, and get everything working as before.

Improve parser data types

Start revising the parser produced data types with a more DOD style. Start with a ParsedFragment that is a bit more than a sequence of parsed nodes (minimal object) and some metadata. Internally it should be a DOD layout and be produced by parseAll, parseTopLevelStmt, etc. A fragment could be some arbitrary string of AST or a while file/module, statement, etc. The idea is this is the parser's production for a single parse action. If a whole file vs fragment distinction proves to be useful then do that.

First cut data type will be easy and then getting the parser code to use it will be where it gets interesting focus on getting it working and keeping the data model as clean as possible. Mark all the weirdness so we know what to come back to next.

Smooth out rough edges

The above will get it working, now we clean things up so we're learning from the design smells showing up in the data model.

Final remarks on approach

This doesn't have to be a big bang. Quick wins regardless of this work should be PR'd separately ahead of time. If the first part of the approach doesn't hurt performance too much then that can come in on its own. If the second is clean enough then ditto. Other intermediate milestones might be discovered and get pulled in early.

todo steps added by @haxscramper, collected from the discussion in #425

Clean up the parser implementation - this PR. It introduces another slowdown of the parsing stage, but not too drastic, and immediately enables decoupling.
Throw away nimpretty hacks - `nimpretty` #113
Transition lexer to use DOD approach - I'm thinking of a global FileIndex -> seq[Token] mapping that is generated by the lexer. Each token includes a line, a column and a tag information. And an extent range.
- ParsedNode includes (FileIndex, TokenIndex) pair that is used to get the token back from the main storage
- No tokens are discarded from the tokenizer - Non-documentation comments, obviously included. I'm not exactly sure about the INDENT/DEDENT/SAME-INDENT token set, but I think we can also retain them.
Add reduced parsed node set - based on the TNodeKind, but without sem-related data - while ParsedNode and PNode are structurally similar, the latter one has a lot more states.
Change the ParsedNode to use a DOD structure as well

The text was updated successfully, but these errors were encountered:

saem · 2022-08-31T19:57:54Z

@haxscramper took a bit but to a first crack at it. Wrote it quickly on my phone during lunch so hopefully it's not too bad.

haxscramper · 2022-08-31T20:03:29Z

I think "Decouple the parser from PNode", "Improve parser data types" and follow-ups can be implemented as three separate PRs - the first one is rather easy, so I can start working on it right away, and then we can look closer at the DOD rewrite.

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous author of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. -

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. -

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234`

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` - Implement `astrepr.nim` support for the `ParsedNode` - `debug` and `treeRepr` procedures. Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` - Implement `astrepr.nim` support for the `ParsedNode` - `debug` and `treeRepr` procedures. - Allow skipping repeated symbol in the `(open|closed)SymChoice` node kinds. Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` - Implement `astrepr.nim` support for the `ParsedNode` - `debug` and `treeRepr` procedures. - Allow skipping repeated symbol in the `(open|closed)SymChoice` node kinds. - compiler/front/options.nim:693 :: Unconditionally output debugging traces if they are requested, regardless of the surrounding hooks and filters. Introduce the `bypassWriteHookForTrace` flag in the debugging hack controller which makes it possible to bypass the `writeln` hook. Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` - Implement `astrepr.nim` support for the `ParsedNode` - `debug` and `treeRepr` procedures. - Allow skipping repeated symbol in the `(open|closed)SymChoice` node kinds in the `astrepr` - Restructure imports of the `astepr` and move it closer to the 'primitive' modules - type definitions and trivial data queries. The most important change is removal of the `ast.nim` and `renderer.nim` imports, which opens these modules for debugging as well. - Consider possibility of a nil `owner` in the symbol owner chain representation calculations in `astrepr` - compiler/front/options.nim:693 :: Unconditionally output debugging traces if they are requested, regardless of the surrounding hooks and filters. Introduce the `bypassWriteHookForTrace` flag in the debugging hack controller which makes it possible to bypass the `writeln` hook. Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` - Implement `astrepr.nim` support for the `ParsedNode` - `debug` and `treeRepr` procedures. - Allow skipping repeated symbol in the `(open|closed)SymChoice` node kinds in the `astrepr` - Restructure imports of the `astepr` and move it closer to the 'primitive' modules - type definitions and trivial data queries. The most important change is removal of the `ast.nim` and `renderer.nim` imports, which opens these modules for debugging as well. - Consider possibility of a nil `owner` in the symbol owner chain representation calculations in `astrepr` - Semantic tracer debug output file rotation now uses location of the first `.define(` call as a file name base instead of integer-based ones. Added basic logging information about created files - now a developer can see what is going on and what gets written. For example, running with `--define=nimCompilerDebugTraceDir=/tmp` and seveal `define(...)` sections produces the following output: ``` comparisons.nim(269, 8): opening /tmp/comparisons_nim_0 trace comparisons.nim(274, 7): closing trace, wrote 44 records comparisons.nim(276, 8): opening /tmp/comparisons_nim_1 trace comparisons.nim(285, 7): closing trace, wrote 329 records ``` - compiler/front/options.nim:693 :: Unconditionally output debugging traces if they are requested, regardless of the surrounding hooks and filters. Introduce the `bypassWriteHookForTrace` flag in the debugging hack controller which makes it possible to bypass the `writeln` hook. Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 * Documentation changes - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b * Tangentially related refactoring work - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` * Debugging tools improvements - Implement `astrepr.nim` support for the `ParsedNode` and `PIdent` - `debug` and `treeRepr` procedures. - Allow skipping repeated symbol in the `(open|closed)SymChoice` node kinds in the `astrepr` - Restructure imports of the `astepr` and move it closer to the 'primitive' modules - type definitions and trivial data queries. The most important change is removal of the `ast.nim` and `renderer.nim` imports, which opens these modules for debugging as well. - Consider possibility of a nil `owner` in the symbol owner chain representation calculations in `astrepr` - Semantic tracer debug output file rotation now uses location of the first `.define(` call as a file name base instead of integer-based ones. Added basic logging information about created files - now a developer can see what is going on and what gets written. For example, running with `--define=nimCompilerDebugTraceDir=/tmp` and seveal `define(...)` sections produces the following output: ``` comparisons.nim(269, 8): opening /tmp/comparisons_nim_0 trace comparisons.nim(274, 7): closing trace, wrote 44 records comparisons.nim(276, 8): opening /tmp/comparisons_nim_1 trace comparisons.nim(285, 7): closing trace, wrote 329 records ``` - Simplify implementation of the `reportInst` handling in the debug utils tracer - now each toplevel tracer template must submit the location by itself - this solution avoids unintuitive and fragile `instLoc(-5)` call which might break with more templates introduced. Also updated documentation on the `reportInst` and `reportFrom` in the reports file. - compiler/front/options.nim:693 :: Unconditionally output debugging traces if they are requested, regardless of the surrounding hooks and filters. Introduce the `bypassWriteHookForTrace` flag in the debugging hack controller which makes it possible to bypass the `writeln` hook. * Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

425: decouple parser from PNode r=saem a=haxscramper wip implementation of the #423 Co-authored-by: haxscramper <haxscramper@gmail.com>

haxscramper · 2022-09-04T09:00:49Z

todo items from the PR

Add node kind for different literal bases - binary, decimal, octal, hexadecimal. This will be a x4 increase in the number of parsed literal nodes, but will normalize the data and allow ParsedNode to be a simple [tag, left, right] tuple of enum/integer
proc transitionSonsKind*(n: ParsedNode, kind: TNodeKind) = should check for transition validity, ideally parser should not have the need for midway transitions - adding a couple more nodes that describe different states and doing translation to PNode later is a preferred way.
https://github.com/nim-works/nimskull/blob/devel/compiler/ast/ast_query.nim#L114
isBlockArg*: bool new node kind for this distinction
Overhaul and document different enums that are defined in the ast_query.nim file

saem added the refactor Implementation refactor label Aug 31, 2022

saem added this to Simplify the Language to a Workable Core Aug 31, 2022

haxscramper self-assigned this Aug 31, 2022

haxscramper mentioned this issue Sep 1, 2022

decouple parser from PNode #425

Merged

bors bot added a commit that referenced this issue Sep 4, 2022

Merge #425

f4b8323

425: decouple parser from PNode r=saem a=haxscramper wip implementation of the #423 Co-authored-by: haxscramper <haxscramper@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple parser from PNode with DOD approach #423

Decouple parser from PNode with DOD approach #423

saem commented Aug 31, 2022 •

edited by haxscramper

Loading

saem commented Aug 31, 2022

haxscramper commented Aug 31, 2022

haxscramper commented Sep 4, 2022 •

edited

Loading

Decouple parser from PNode with DOD approach #423

Decouple parser from PNode with DOD approach #423

Comments

saem commented Aug 31, 2022 • edited by haxscramper Loading

The approach at the out set:

Deets

Decouple the parser from PNode

Improve parser data types

Smooth out rough edges

Final remarks on approach

saem commented Aug 31, 2022

haxscramper commented Aug 31, 2022

haxscramper commented Sep 4, 2022 • edited Loading

saem commented Aug 31, 2022 •

edited by haxscramper

Loading

Decouple the parser from `PNode`

haxscramper commented Sep 4, 2022 •

edited

Loading