Integrate Graph Builder from `tree-sitter-graph` #918

AntonyBlakey · 2024-04-03T14:49:44Z

Includes metaslang_graph_builder aka tree-sitter-graph.

Includes some WIP e.g. tests that will be integrated into testlang once this is accepted.

changeset-bot · 2024-04-03T14:49:49Z

⚠️ No Changeset found

Latest commit: b70a110

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Xanewok

Preliminary review, didn't look at the query parsing change, changes to generic CST and the actual graph builder itself yet.

Xanewok · 2024-05-17T10:10:16Z

crates/metaslang/cst/src/text_index.rs

+            // line separator
+            || c == '\u{2028}'
+            // paragraph separator
+            || c == '\u{2029}'


What the language constitutes as a newline affects parsing but in the case of calculating a correct line number we must make sure that this will be consistent across multiple text editors or other tools that might ingest that data.

Both Vim and VS Code treat LS and PS as whitespace; we'd need to test more editors, but I think we should probably only account for LF/CR/CRLF here.

In order to be consistent with other tools, that obey the official solidity definition of a line break, I'm choosing to use the solidity definition here. This is not consistent across languages, nor across editors or other text tools (wc ?), but I think here we should be merely pragmatic.

ECMAScript also uses these and I was curious what the behaviour is, so I went ahead and tried starting the code with LS and curiously enough, the tsc really does treat it as a logical newline in the diagnostics:

While it seems peculiar, I rest my case then but it'd probably be good to add a comment that it explicitly matches Solidity to clear any possible confusion; long-term, it should be dependent on the language definition but let's not go there now 😅

'rest my case' would mean that you were proved correct :), but I think you mean that you accept we should use LS and PS as line break?

I agree we should use the language def, but that will never be explicitly synced with the editor or tool behaviour (apart from in metaslang!), it will only coincidentally match until the world 100% conforms to unicode.

Ah, my bad then! Meant to say that I've resolved my concern.

I went down the rabbit hole while I was looking at #918 (comment) to see how they treat bare \r in general, and I found that solc does some really funny stuff as usual:

I'll start with the fact that despite defining the PS/LS/NEL sequences as terminating the string literals wrt. lines, you can't reasonably use PS/LS in general in the code, because (at least when I tested it) it just throws a parser error anyway 😅 .

They use byte offsets almost exclusively, even as the range for the errors in the Standard JSON output, so they escape the problem of how to interpret lines in the majority of cases. However, when they need the lines as well, they use CharStream::translatePositionToLineColumn, which defines the lines as... literally just the count of the \n characters. That's what their built-in LSP uses, so they get correct lines in the editor, despite the presence of PS/LS (ignoring the fact that it means a parse error anyway).

So, with all that said, for anything that needs lines, I'd say that we probably should just mirror what solc does and only count \n, because it's simpler and what they do (and PS/LS is unusable). What do you think?

using just one char and ignoring \r\n would have been simpler, but I've done the work now.

I'm wondering if this is a case where we do the right thing, rather than the solc thing?

I'm worried that doing the "right" thing here is actually not a pragmatic choice, i.e. we will get incorrect line numbers in almost all of the editors. However, given that any Solidity code using LS/PS seems to be invalid anyway, I don't want to block this as hopefully this is a non-issue anyway.

Xanewok · 2024-05-17T10:11:01Z

crates/metaslang/cst/src/text_index.rs

        }
-        Self { utf8, utf16, char }
+        result
    }
 }

 impl From<&String> for TextIndex {


It's fine for convenience but I think we should probably get rid of this impl altogether; the only reason we have it is because we don't autoderef String through Rc to &str; see #964

I think we shouldn't do that in this PR

Xanewok · 2024-05-17T10:15:56Z

crates/metaslang/cst/src/text_index.rs

+    pub fn advance(&mut self, c: char, next: Option<&char>) {
+        self.utf8 += c.len_utf8();
+        self.utf16 += c.len_utf16();
+        if c == '\n'


style nitpick: This could probably be rewritten for more clarity as

match (c, next) { ('\n', _) => ..., ('\r', next) if next != Some(&'\n') => ... _ => ...

I note, ironically, that the code you suggest hides an error - '\r' followed by '\n' shouldn't advance the column. So even thouogh I prefer match and your solution, I'm not sure that it is more clear.

My bad, this was written quickly. Something like this, then:

match (c, next) { ('\n' | '\u{2028}' | '\u{2029}, _) => ..., ('\r', Some('\n')) => {/* Ignore for now, we will already advance the column once we skip to \n, per above */}, ('\r', _) => ... _ => ...

which should be more clear; the nested condition in the \r was subtle enough that it I didn't completely get it while glancing at the code, whereas this should be more clear with the comment etc.

Also see #918 (comment); there's a chance that simply accounting for \n is the pragmatic choice here.

Xanewok · 2024-05-17T10:34:53Z

crates/metaslang/cst/src/text_index.rs


 use serde::Serialize;

-#[derive(Default, Copy, Clone, PartialEq, Eq, Debug, Serialize)]
+#[derive(Default, Hash, Copy, Clone, PartialEq, Eq, Debug, Serialize)]
 pub struct TextIndex {


I'm guessing we're bundling everything into a single type for convenience, lack of perf numbers and velocity but I imagine that having just the existing TextIndex, without having to always track/calculate the column, that allows to quickly slice into the contents buffer will be useful (we might emulate this with line: 0 but that's not technically correct).

Do you think we should have a separate TextPosition that might bundle the the existing/hypothetical TextIndex with the line?

This is a decision dependent on perf analysis IMO.

Xanewok · 2024-05-17T10:41:03Z

crates/metaslang/cst/src/cst.rs

+    }
+
+    /// Returns a slice of the children (not all descendants) of this node.
+    pub fn children(&self) -> &[Edge<T>] {


style/general nit: Rather than having two functions with different names on the same concrete type, I think it'd be cleaner to:

keep the more obvious or related more immediately to the concrete type terms like children (since this is parent/child 'contains' relationship rather than a general graph that happens to be a tree)

provide a separate, more abstract trait, which this concrete type would implement, that might provide a more abstract functionality like edges.

This way, one could use the higher-level abstraction using the traits (I imagine we will want to define that for subsequent tree/node-traversable types) if they need but if they're using a concrete type, they will use the immediate/more appropriate name.

We should probably also expose labeled_children as well, if we aim to keep the symmetry here.

It should be edge in any case ...

children should be the nodes at the end of all edges from this node.

This feels like a drive-by. We have two other issues looking at this (renaming, api review)

Xanewok · 2024-05-17T10:42:00Z

crates/metaslang/cst/src/cst.rs

@@ -62,21 +90,43 @@ impl<T: KindTypes> Node<T> {
        Self::Token(Rc::new(TerminalNode { kind, text }))
    }

+    pub fn id(&self) -> usize {


We probably should add a comment stating that it's supposed to be an opaque identifier that is not guaranteed to be stable across different versions; I know we were discussing potentially using an abstract NodeId here rather than the pointer we have now.

I'm not sure about the persistence semantics of this id. Should it be the same if you parse the same source twice for example?

It shouldn't, but it's important not to make any assumption about the values - now we'll have 0x80001256189, at some point we might switch to lower IDs like 1,2,3 so I think it's worth clarifying this intent to not give any funny idea for the caller to bitpack it themselves at some point etc.

I added a comment

crates/codegen/runtime/cargo/src/runtime/generated/kinds.rs

crates/codegen/runtime/cargo/src/runtime/parse_error.rs

crates/metaslang/cst/src/query/engine.rs

crates/metaslang/cst/src/lib.rs

crates/metaslang/cst/src/text_index.rs

OmarTawfik · 2024-05-23T14:04:35Z

crates/metaslang/graph_builder/vscode/package.json

@@ -0,0 +1,37 @@
+{


I wonder if we need the vscode extension here? I think we discussed removing/not including it?

I am intending to enable it

crates/metaslang/graph_builder/Cargo.toml

crates/solidity/outputs/cargo/tests/src/doc_examples/using_queries.rs

crates/solidity/outputs/npm/tests/src/doc-examples/using-queries.ts

Following up on changes in #971 and #918.

AntonyBlakey changed the title ~~AntonyBlakey/integrate-stack-graphs~~ Integrate-stack-graphs Apr 3, 2024

AntonyBlakey changed the title ~~Integrate-stack-graphs~~ Integrate Stack Graphs Apr 3, 2024

AntonyBlakey self-assigned this Apr 3, 2024

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 5 times, most recently from 76acfcf to 940bd76 Compare April 9, 2024 11:40

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 2 times, most recently from d6b6e82 to a852b24 Compare April 25, 2024 12:12

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 4 times, most recently from 14b7b47 to fe48dce Compare May 14, 2024 07:55

AntonyBlakey marked this pull request as ready for review May 14, 2024 07:56

AntonyBlakey requested a review from a team as a code owner May 14, 2024 07:56

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from fe48dce to 6759561 Compare May 14, 2024 08:07

AntonyBlakey changed the title ~~Integrate Stack Graphs~~ Integrate Graph Builder from tree-sitter-graphs May 14, 2024

AntonyBlakey changed the title ~~Integrate Graph Builder from tree-sitter-graphs~~ Integrate Graph Builder from tree-sitter-graph May 14, 2024

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from 6759561 to e92dd55 Compare May 15, 2024 07:42

AntonyBlakey marked this pull request as draft May 15, 2024 16:02

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from e92dd55 to a6f43e3 Compare May 16, 2024 14:14

AntonyBlakey marked this pull request as ready for review May 16, 2024 14:22

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 2 times, most recently from e2ec358 to ac55379 Compare May 17, 2024 08:49

Xanewok reviewed May 17, 2024

View reviewed changes

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 3 times, most recently from fb9c296 to 77c63c1 Compare May 17, 2024 12:23

AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from 77c63c1 to 8f95287 Compare May 17, 2024 14:13

AntonyBlakey requested a review from Xanewok May 17, 2024 14:14

Xanewok approved these changes May 23, 2024

View reviewed changes