Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AST #109

Draft
wants to merge 177 commits into
base: main
Choose a base branch
from
Draft

Use AST #109

wants to merge 177 commits into from

Conversation

pherrymason
Copy link
Owner

@pherrymason pherrymason commented Jan 26, 2025

Motivation

Treesitter is used to obtain a structured representation of source code, which has been utilized to extract symbol declarations.

So far, Treesitter has been used to parse the source code to analyze and obtain a structured representation of the code: Context Syntax Tree (CST). The CST, also known as the Parse Tree, contains all the elements that are part of the text (spaces, brackets, etc.).
Although the CST can be navigated to analyze the code, it contains too much information about the language's syntax, making certain tasks more challenging.

On the other hand, using an AST (Abstract Syntax Tree) would allow working with a simpler structure that more accurately reflects the program's logic.

Strategy

Continuing to use Treesitter as a parser simplifies a large part of the work, as there is currently no other ready-to-use parser available.
One option is to convert the CST generated by Treesitter into an AST.
Subsequently, by using the Visitor pattern, it would be possible to implement different analyzers:

  • Symbol table generator
  • Diagnostics
  • Etc.

TODO

@pherrymason pherrymason marked this pull request as draft January 26, 2025 17:36
Do not ignore ERROR treesitter nodes. This reduces scenarios where incomplete source code cannot be located by cursor position while still being able to get that text content.
Refactor analysis code + tests.
Extract astContext to hold information about node under cursor being analyzed.
Protect against variable GenDecl not having type information.
…1b4b.

Grab position context by reading string under cursor instead of finding the node, as this is not always reliable due to ERROR nodes.
Fix completion by announcing methods kind as methods instead of plain functions.
Fix parsing access_ident.
… are no parse errors. Use string analysis when cursor is inside parse error.

Completion: Improve tests on enum values and methods.
Completion: solveXAtSelectorExpr does not resolve iden type so completion can decide if child symbols can be included in result or not.
Drop fromPosition and use Location instead.
Add tests for autocompleting fault constants and its methods.
@pherrymason
Copy link
Owner Author

Some notes @PgBiel
I'm finding some of your changes difficult to port.
For example, these commits: #aadb305 and #aaa629.

These take the assumption that cursor context is retrieved by reading the string under the cursor.
However, this new version tries to get the AST node under the cursor...

The issue here is that these cases are actually invalid, as they are not allowed by the grammar, and the AST generated is undefined (and might be flagged with Error).

Notes to discuss:

  • Should we fallback to old string analysis if node under cursor is flagged as Error?
  • Should disable the functionality if node under cursor is flagged as Error?

@PgBiel
Copy link
Contributor

PgBiel commented Feb 11, 2025

I haven't had the time to look at your code yet, but do you mean that Abc.VALUE. would be invalid? I'm pretty sure not, since associated values are accessible. Or perhaps you mean that you can't run completion because the extra dot at the end breaks the parser / AST node generation?

Assuming it's not the last option, my code, at those commits, basically just had some extra logic to not suggest enum members / fault constants after variables. Code like MyEnum.VALUE.VALUE is obviously invalid (and probably doesn't parse) but we don't know that until it is typed out, so being smart with the completions before suggesting them is necessary.
With the AST you could still keep some state indicating that you're searching for completions from a variable and not from the top-level enum, unless I'm missing something.

@pherrymason
Copy link
Owner Author

I haven't had the time to look at your code yet,
Don't worry, there is no commitment. This comment was more of a note for future analysis and just a discussion.

Yes, Abc.VALUE. is semi-valid, but some other scenarios are not. For example:

fn void main(){
status = WindowStatus.BACKGROUND.MINIMIZED;
}

This is not valid, but treesitter still tries to parse it, giving a tree like this:

imagen

Where the ERROR node contains the source WindowStatus.BACKGROUND. and right node MINIMIZED.
This causes a problem when trying to know what MINIMIZED relates to, because when reading right you find an incomplete ident, and the ERROR node stays there, without a clear indication of what actually is.

This treesitter tree translates to my AST C3 Tree:

AssignmentExpression{
       Left: { Ident:"status" },
       Right: {Ident: "MINIMIZED"},
}

This affects analysis done on GoTo and Completion operations because when trying to analyze the Right node, I find an incomplete identity and I'm not able to see it is a descendant of a SelectorExpression.

I could obviously store the ERROR node inside AssignmentExpression, but I should also store if it comes after or before the = to help the analyzer to interpret it, but this is just an example and the combinations where the ERROR can appear are infinite.

I'm just exposing the difficulties I'm encountering :D

@PgBiel
Copy link
Contributor

PgBiel commented Feb 12, 2025

Don't worry, there is no commitment. This comment was more of a note for future analysis and just a discussion.
...
I'm just exposing the difficulties I'm encountering :D

Oh don't worry, I was just trying to contextualize my message, i.e. it's possible that the answer to my question was obvious from the code (and sorry if so), I just didn't have the time to take a proper look at it yet. Didn't mean to imply that you were applying too much pressure, or something similar - expressing tone in online text is hard. Sorry! :)

Feel free to send further questions, I'll always try to answer them.

I could obviously store the ERROR node inside AssignmentExpression, but I should also store if it comes after or before the = to help the analyzer to interpret it, but this is just an example and the combinations where the ERROR can appear are infinite.

Ah I see the problem.

If I'm going to be entirely fair, that particular case is evidently pathological since the syntax is invalid, but it still would be nice to be able to properly recover from those weirdnesses.

I think we could blame the tree-sitter grammar here. Maybe it would be nice if they could allow any invalid ident casing anywhere, using a specific node for this, e.g. invalid_ident.

@pherrymason
Copy link
Owner Author

I guess I will still need to do string analysis on cursor position just to avoid these kind of issues...

@PgBiel
Copy link
Contributor

PgBiel commented Feb 13, 2025

Not sure if that's strictly necessary... I see two options:

  1. Request a change in the original treesitter grammar for extra flexibility;
  2. Keep a fork of it with the changes we need. This might be necessary anyway in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants