-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Graph Builder from tree-sitter-graph
#918
Integrate Graph Builder from tree-sitter-graph
#918
Conversation
|
76acfcf
to
940bd76
Compare
d6b6e82
to
a852b24
Compare
14b7b47
to
fe48dce
Compare
fe48dce
to
6759561
Compare
tree-sitter-graphs
tree-sitter-graphs
tree-sitter-graph
6759561
to
e92dd55
Compare
e92dd55
to
a6f43e3
Compare
e2ec358
to
ac55379
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review, didn't look at the query parsing change, changes to generic CST and the actual graph builder itself yet.
// line separator | ||
|| c == '\u{2028}' | ||
// paragraph separator | ||
|| c == '\u{2029}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the language constitutes as a newline affects parsing but in the case of calculating a correct line number we must make sure that this will be consistent across multiple text editors or other tools that might ingest that data.
Both Vim and VS Code treat LS and PS as whitespace; we'd need to test more editors, but I think we should probably only account for LF/CR/CRLF here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to be consistent with other tools, that obey the official solidity definition of a line break, I'm choosing to use the solidity definition here. This is not consistent across languages, nor across editors or other text tools (wc
?), but I think here we should be merely pragmatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ECMAScript also uses these and I was curious what the behaviour is, so I went ahead and tried starting the code with LS and curiously enough, the tsc
really does treat it as a logical newline in the diagnostics:
While it seems peculiar, I rest my case then but it'd probably be good to add a comment that it explicitly matches Solidity to clear any possible confusion; long-term, it should be dependent on the language definition but let's not go there now 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'rest my case' would mean that you were proved correct :), but I think you mean that you accept we should use LS
and PS
as line break?
I agree we should use the language def, but that will never be explicitly synced with the editor or tool behaviour (apart from in metaslang!), it will only coincidentally match until the world 100% conforms to unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my bad then! Meant to say that I've resolved my concern.
I went down the rabbit hole while I was looking at #918 (comment) to see how they treat bare \r
in general, and I found that solc
does some really funny stuff as usual:
I'll start with the fact that despite defining the PS/LS/NEL sequences as terminating the string literals wrt. lines, you can't reasonably use PS/LS in general in the code, because (at least when I tested it) it just throws a parser error anyway 😅 .
They use byte offsets almost exclusively, even as the range for the errors in the Standard JSON output, so they escape the problem of how to interpret lines in the majority of cases. However, when they need the lines as well, they use CharStream::translatePositionToLineColumn
, which defines the lines as... literally just the count of the \n
characters. That's what their built-in LSP uses, so they get correct lines in the editor, despite the presence of PS/LS (ignoring the fact that it means a parse error anyway).
So, with all that said, for anything that needs lines, I'd say that we probably should just mirror what solc does and only count \n
, because it's simpler and what they do (and PS/LS is unusable). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using just one char and ignoring \r\n
would have been simpler, but I've done the work now.
I'm wondering if this is a case where we do the right thing, rather than the solc thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that doing the "right" thing here is actually not a pragmatic choice, i.e. we will get incorrect line numbers in almost all of the editors. However, given that any Solidity code using LS/PS seems to be invalid anyway, I don't want to block this as hopefully this is a non-issue anyway.
} | ||
Self { utf8, utf16, char } | ||
result | ||
} | ||
} | ||
|
||
impl From<&String> for TextIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for convenience but I think we should probably get rid of this impl altogether; the only reason we have it is because we don't autoderef String through Rc to &str; see #964
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't do that in this PR
pub fn advance(&mut self, c: char, next: Option<&char>) { | ||
self.utf8 += c.len_utf8(); | ||
self.utf16 += c.len_utf16(); | ||
if c == '\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nitpick: This could probably be rewritten for more clarity as
match (c, next) {
('\n', _) => ...,
('\r', next) if next != Some(&'\n') => ...
_ => ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note, ironically, that the code you suggest hides an error - '\r'
followed by '\n'
shouldn't advance the column. So even thouogh I prefer match
and your solution, I'm not sure that it is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, this was written quickly. Something like this, then:
match (c, next) {
('\n' | '\u{2028}' | '\u{2029}, _) => ...,
('\r', Some('\n')) => {/* Ignore for now, we will already advance the column once we skip to \n, per above */},
('\r', _) => ...
_ => ...
which should be more clear; the nested condition in the \r
was subtle enough that it I didn't completely get it while glancing at the code, whereas this should be more clear with the comment etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see #918 (comment); there's a chance that simply accounting for \n
is the pragmatic choice here.
|
||
use serde::Serialize; | ||
|
||
#[derive(Default, Copy, Clone, PartialEq, Eq, Debug, Serialize)] | ||
#[derive(Default, Hash, Copy, Clone, PartialEq, Eq, Debug, Serialize)] | ||
pub struct TextIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing we're bundling everything into a single type for convenience, lack of perf numbers and velocity but I imagine that having just the existing TextIndex, without having to always track/calculate the column, that allows to quickly slice into the contents buffer will be useful (we might emulate this with line: 0
but that's not technically correct).
Do you think we should have a separate TextPosition
that might bundle the the existing/hypothetical TextIndex
with the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a decision dependent on perf analysis IMO.
} | ||
|
||
/// Returns a slice of the children (not all descendants) of this node. | ||
pub fn children(&self) -> &[Edge<T>] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style/general nit: Rather than having two functions with different names on the same concrete type, I think it'd be cleaner to:
- keep the more obvious or related more immediately to the concrete type terms like
children
(since this is parent/child 'contains' relationship rather than a general graph that happens to be a tree) - provide a separate, more abstract trait, which this concrete type would implement, that might provide a more abstract functionality like
edges
.
This way, one could use the higher-level abstraction using the traits (I imagine we will want to define that for subsequent tree/node-traversable types) if they need but if they're using a concrete type, they will use the immediate/more appropriate name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also expose labeled_children
as well, if we aim to keep the symmetry here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be edge
in any case ...
children
should be the nodes at the end of all edges from this node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a drive-by. We have two other issues looking at this (renaming, api review)
@@ -62,21 +90,43 @@ impl<T: KindTypes> Node<T> { | |||
Self::Token(Rc::new(TerminalNode { kind, text })) | |||
} | |||
|
|||
pub fn id(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should add a comment stating that it's supposed to be an opaque identifier that is not guaranteed to be stable across different versions; I know we were discussing potentially using an abstract NodeId
here rather than the pointer we have now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the persistence semantics of this id. Should it be the same if you parse the same source twice for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't, but it's important not to make any assumption about the values - now we'll have 0x80001256189, at some point we might switch to lower IDs like 1,2,3 so I think it's worth clarifying this intent to not give any funny idea for the caller to bitpack it themselves at some point etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment
fb9c296
to
77c63c1
Compare
77c63c1
to
8f95287
Compare
8f95287
to
b70a110
Compare
@@ -0,0 +1,37 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need the vscode extension here? I think we discussed removing/not including it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am intending to enable it
Includes
metaslang_graph_builder
akatree-sitter-graph
.Includes some WIP e.g. tests that will be integrated into
testlang
once this is accepted.