Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Graph Builder from tree-sitter-graph #918

Conversation

AntonyBlakey
Copy link
Contributor

@AntonyBlakey AntonyBlakey commented Apr 3, 2024

Includes metaslang_graph_builder aka tree-sitter-graph.

Includes some WIP e.g. tests that will be integrated into testlang once this is accepted.

Copy link

changeset-bot bot commented Apr 3, 2024

⚠️ No Changeset found

Latest commit: b70a110

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@AntonyBlakey AntonyBlakey changed the title AntonyBlakey/integrate-stack-graphs Integrate-stack-graphs Apr 3, 2024
@AntonyBlakey AntonyBlakey changed the title Integrate-stack-graphs Integrate Stack Graphs Apr 3, 2024
@AntonyBlakey AntonyBlakey self-assigned this Apr 3, 2024
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 5 times, most recently from 76acfcf to 940bd76 Compare April 9, 2024 11:40
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 2 times, most recently from d6b6e82 to a852b24 Compare April 25, 2024 12:12
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 4 times, most recently from 14b7b47 to fe48dce Compare May 14, 2024 07:55
@AntonyBlakey AntonyBlakey marked this pull request as ready for review May 14, 2024 07:56
@AntonyBlakey AntonyBlakey requested a review from a team as a code owner May 14, 2024 07:56
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from fe48dce to 6759561 Compare May 14, 2024 08:07
@AntonyBlakey AntonyBlakey changed the title Integrate Stack Graphs Integrate Graph Builder from tree-sitter-graphs May 14, 2024
@AntonyBlakey AntonyBlakey changed the title Integrate Graph Builder from tree-sitter-graphs Integrate Graph Builder from tree-sitter-graph May 14, 2024
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from 6759561 to e92dd55 Compare May 15, 2024 07:42
@AntonyBlakey AntonyBlakey marked this pull request as draft May 15, 2024 16:02
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from e92dd55 to a6f43e3 Compare May 16, 2024 14:14
@AntonyBlakey AntonyBlakey marked this pull request as ready for review May 16, 2024 14:22
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 2 times, most recently from e2ec358 to ac55379 Compare May 17, 2024 08:49
Copy link
Contributor

@Xanewok Xanewok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review, didn't look at the query parsing change, changes to generic CST and the actual graph builder itself yet.

Comment on lines 28 to 34
// line separator
|| c == '\u{2028}'
// paragraph separator
|| c == '\u{2029}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the language constitutes as a newline affects parsing but in the case of calculating a correct line number we must make sure that this will be consistent across multiple text editors or other tools that might ingest that data.

Both Vim and VS Code treat LS and PS as whitespace; we'd need to test more editors, but I think we should probably only account for LF/CR/CRLF here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to be consistent with other tools, that obey the official solidity definition of a line break, I'm choosing to use the solidity definition here. This is not consistent across languages, nor across editors or other text tools (wc ?), but I think here we should be merely pragmatic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECMAScript also uses these and I was curious what the behaviour is, so I went ahead and tried starting the code with LS and curiously enough, the tsc really does treat it as a logical newline in the diagnostics:

image
image

While it seems peculiar, I rest my case then but it'd probably be good to add a comment that it explicitly matches Solidity to clear any possible confusion; long-term, it should be dependent on the language definition but let's not go there now 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'rest my case' would mean that you were proved correct :), but I think you mean that you accept we should use LS and PS as line break?

I agree we should use the language def, but that will never be explicitly synced with the editor or tool behaviour (apart from in metaslang!), it will only coincidentally match until the world 100% conforms to unicode.

Copy link
Contributor

@Xanewok Xanewok May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad then! Meant to say that I've resolved my concern.

I went down the rabbit hole while I was looking at #918 (comment) to see how they treat bare \r in general, and I found that solc does some really funny stuff as usual:

I'll start with the fact that despite defining the PS/LS/NEL sequences as terminating the string literals wrt. lines, you can't reasonably use PS/LS in general in the code, because (at least when I tested it) it just throws a parser error anyway 😅 .

They use byte offsets almost exclusively, even as the range for the errors in the Standard JSON output, so they escape the problem of how to interpret lines in the majority of cases. However, when they need the lines as well, they use CharStream::translatePositionToLineColumn, which defines the lines as... literally just the count of the \n characters. That's what their built-in LSP uses, so they get correct lines in the editor, despite the presence of PS/LS (ignoring the fact that it means a parse error anyway).

So, with all that said, for anything that needs lines, I'd say that we probably should just mirror what solc does and only count \n, because it's simpler and what they do (and PS/LS is unusable). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using just one char and ignoring \r\n would have been simpler, but I've done the work now.

I'm wondering if this is a case where we do the right thing, rather than the solc thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that doing the "right" thing here is actually not a pragmatic choice, i.e. we will get incorrect line numbers in almost all of the editors. However, given that any Solidity code using LS/PS seems to be invalid anyway, I don't want to block this as hopefully this is a non-issue anyway.

}
Self { utf8, utf16, char }
result
}
}

impl From<&String> for TextIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine for convenience but I think we should probably get rid of this impl altogether; the only reason we have it is because we don't autoderef String through Rc to &str; see #964

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't do that in this PR

pub fn advance(&mut self, c: char, next: Option<&char>) {
self.utf8 += c.len_utf8();
self.utf16 += c.len_utf16();
if c == '\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nitpick: This could probably be rewritten for more clarity as

match (c, next) {
  ('\n', _) => ...,
  ('\r', next) if next != Some(&'\n') => ...
  _ => ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note, ironically, that the code you suggest hides an error - '\r' followed by '\n' shouldn't advance the column. So even thouogh I prefer match and your solution, I'm not sure that it is more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this was written quickly. Something like this, then:

match (c, next) {
  ('\n' | '\u{2028}' | '\u{2029}, _) => ...,
  ('\r', Some('\n')) => {/* Ignore for now, we will already advance the column once we skip to \n, per above */},
  ('\r', _) => ...
  _ => ...

which should be more clear; the nested condition in the \r was subtle enough that it I didn't completely get it while glancing at the code, whereas this should be more clear with the comment etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see #918 (comment); there's a chance that simply accounting for \n is the pragmatic choice here.


use serde::Serialize;

#[derive(Default, Copy, Clone, PartialEq, Eq, Debug, Serialize)]
#[derive(Default, Hash, Copy, Clone, PartialEq, Eq, Debug, Serialize)]
pub struct TextIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing we're bundling everything into a single type for convenience, lack of perf numbers and velocity but I imagine that having just the existing TextIndex, without having to always track/calculate the column, that allows to quickly slice into the contents buffer will be useful (we might emulate this with line: 0 but that's not technically correct).

Do you think we should have a separate TextPosition that might bundle the the existing/hypothetical TextIndex with the line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a decision dependent on perf analysis IMO.

}

/// Returns a slice of the children (not all descendants) of this node.
pub fn children(&self) -> &[Edge<T>] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style/general nit: Rather than having two functions with different names on the same concrete type, I think it'd be cleaner to:

  • keep the more obvious or related more immediately to the concrete type terms like children (since this is parent/child 'contains' relationship rather than a general graph that happens to be a tree)
  • provide a separate, more abstract trait, which this concrete type would implement, that might provide a more abstract functionality like edges.

This way, one could use the higher-level abstraction using the traits (I imagine we will want to define that for subsequent tree/node-traversable types) if they need but if they're using a concrete type, they will use the immediate/more appropriate name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also expose labeled_children as well, if we aim to keep the symmetry here.

Copy link
Contributor Author

@AntonyBlakey AntonyBlakey May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be edge in any case ...

children should be the nodes at the end of all edges from this node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a drive-by. We have two other issues looking at this (renaming, api review)

@@ -62,21 +90,43 @@ impl<T: KindTypes> Node<T> {
Self::Token(Rc::new(TerminalNode { kind, text }))
}

pub fn id(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should add a comment stating that it's supposed to be an opaque identifier that is not guaranteed to be stable across different versions; I know we were discussing potentially using an abstract NodeId here rather than the pointer we have now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the persistence semantics of this id. Should it be the same if you parse the same source twice for example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, but it's important not to make any assumption about the values - now we'll have 0x80001256189, at some point we might switch to lower IDs like 1,2,3 so I think it's worth clarifying this intent to not give any funny idea for the caller to bitpack it themselves at some point etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment

@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch 3 times, most recently from fb9c296 to 77c63c1 Compare May 17, 2024 12:23
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from 77c63c1 to 8f95287 Compare May 17, 2024 14:13
@AntonyBlakey AntonyBlakey requested a review from Xanewok May 17, 2024 14:14
@AntonyBlakey AntonyBlakey force-pushed the AntonyBlakey/integrate-stack-graphs branch from 8f95287 to b70a110 Compare May 23, 2024 13:43
@AntonyBlakey AntonyBlakey added this pull request to the merge queue May 23, 2024
Merged via the queue into NomicFoundation:main with commit 7adf17a May 23, 2024
1 check passed
@AntonyBlakey AntonyBlakey deleted the AntonyBlakey/integrate-stack-graphs branch May 23, 2024 14:00
@@ -0,0 +1,37 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need the vscode extension here? I think we discussed removing/not including it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am intending to enable it

github-merge-queue bot pushed a commit that referenced this pull request May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants