Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree-sitter rolling fixes: 1.121 edition #1085

Merged
merged 28 commits into from
Sep 17, 2024
Merged

Conversation

savetheclocktower
Copy link
Contributor

@savetheclocktower savetheclocktower commented Aug 23, 2024

This one starts off big — a sizable diff that probably overstates the scope of the underlying refactor.

Indentation logic is complex. It's not as complex as it seems, but it's still pretty complex, and I haven't exactly helped the issue because I haven't found an intuitive way to explain it. The first commit in this PR represents the beginning of an effort to demystify indentation queries and make them easier for others to reason about.

The main initial benefit is that indentation logic is now encapsulated in its own IndentResolver class. Unlike FoldResolver — a different instance of which exists for each LanguageLayer — a WASMTreeSitterLanguageMode instance has a single instance of IndentResolver. That's because the logic of indentation hinting crosses LanguageLayers; we don't gain anything by subdividing the labor further.

But the encapsulation itself is useful. It lets us write new methods to reduce repetition without increasing the API surface of WASMTreeSitterLanguageMode.

There are two other ideas that I've added in the process:

  • The @match capture is good, but a recent bug report suggested to me that we could benefit from a similar kind of capture that works in the first phase of indentation hinting. So I created @match.next; it's a different approach to phase-one hinting from what you get with @indent/@dedent/@dedent.next captures.

    “Phase one” of indentation hinting is where we determine what the “baseline” indentation is for line X before its own content is considered; we look at the content of line X-1 and decide whether line X should start with the same indentation level, one level more, or one level less. I'll illustrate its usefulness below — much easier when I'm not in an unordered list.

  • For a long time I've wanted to make the indentation decision process more visible. It's not something anyone needs to care about… until they're writing a modern Tree-sitter grammar or trying to figure out why a grammar someone else wrote isn't making the right indentation decision. When I've found myself in this position, I've had no choice but to open dev tools and set some debugger breakpoints.

    Now I'm experimenting with a different approach: an obscure API that, once an indentation decision is made, fires an event with metadata that could help a user understand the logic behind an indentation decision. The IndentResolver::onDidSuggestIndent method accepts a callback and provides an object with a bunch of properties. It's still a bit intimidating to consume directly, but my eventual goal is to write a community package that can interpret the data (or else add such functionality to tree-sitter-tools).

About @match.next

The impetus for @match.next was this example:

function foo() {
  let event = initializeCustomEvent(0, 2, 3, 4, 5, 'event', null, null,
                                                         undefined);
  return event;
}

This is a contrived example, but the idea is that a user should be able to insert whatever sort of hanging indent they want when they're spreading a statement over multiple lines. Since we have Tree-sitter, we should be able to understand that line 4 shouldn't keep the deep indentation of line 3; it should match the indentation level of line 2!

I didn't think about this until a user reported an issue with hanging indents in C. I came up with a fix that's most of the way to what I want, but it uses a @match capture, so operates in the second phase of indentation hinting — the phase that considers the content on the current line. For this reason, it doesn't let us correct the indentation until after the user starts typing.

With @match.next, that fix looks a bit different:

(
  [
    (expression_statement)
    (return_statement)
    (continue_statement)
    (break_statement)
    (throw_statement)
    (debugger_statement)
    (lexical_declaration)
  ] @match.next
  (#is? indent.matchesComparisonRow endPosition)
  (#set! indent.matchIndentOf startPosition)
)

I'm still playing around with this, and I'm only 96% sure it's a good idea. But this could be expressed as follows: “the presence of expression_statement on row X-1 suggests that row X should start with the same level of indentation as that of the row where expression_statement started.” (Also true for return_statement and all other node types within the square brackets.)

Hence in the example above:

  1. The user is typing the end of line 3 and presses return.
  2. Line 3 represents the end of an expression_statement, so it produces a @match.next capture.
  3. Under certain circumstances, we could reject that capture; for instance, if the expression_statement didn't end on row 3. But the indent.matchesComparisonRow test passes, so we proceed.
  4. suggestedIndentForBufferRow interprets the @match.next capture by using the indent.matchIndentOf predicate to figure out which line's indentation should be copied. In this case, it uses startPosition — i.e., the starting position of the expression_statement node.
  5. The cursor starts out at one level of indentation because that's what we had on line 2.

I still sense that I'm not explaining this well enough. I'm not certain that these captures have the right names; one further advantage of the IndentResolver encapsulation is that it would make it a bit easier to introduce new aliases for these captures and predicates while preserving backward compatibility. (For instance, indent.matchIndentOf is now aliased to the terser indent.match, and indent.offsetIndent is aliased to indent.offset.)

Another reason to introduce this new encapsulation is to give us a place to hang new indentation-specific query tests. In order to implement what I described above, I needed a way to introduce the concepts of “current row” and “comparison row” to query tests, and those are indentation-specific ideas. (The current row is the row whose suggested indentation we're determining; the comparison row is the one we're using as reference; typically it's the row just above the current row, but we tend to skip over whitespace-only rows.)

Anyway, if any of this doesn't make sense, and you want to understand it better, please ask! In the process of answering your questions I hope I find ways to make this subject less intimidating.


In other news: a fix I made for tree-sitter-css months ago has been accepted, so I've also bumped our tree-sitter-css to the latest release.


Changelog

  • Updated web-tree-sitter to version 0.23.0.
  • [language-css] Updated tree-sitter-css to the latest version.
  • [language-gfm] Updated tree-sitter-markdown to the latest version.
  • [language-html] Updated tree-sitter-html and tree-sitter-embedded-template to their latest versions.
  • [language-javascript] Updated tree-sitter-javascript to the latest version.
  • [language-typescript] Updated tree-sitter-typescript to the latest version.
  • Added a new @match.next capture for advanced control of how indentation should change from one line to the next.
  • Added a new event that is emitted when Pulsar makes an indentation decision, with the ultimate goal of making indentation hinting easier to understand for grammar authors. The metadata on the event is preliminary and subject to change. (Maybe I'll mention this somewhere, but probably not in the changelog, since it's not part of a public API yet.)
  • Added new indentation-specific query predicates indent.matchesComparisonRow and indent.matchesCurrentRow for comparing arbitrary positions in a Tree-sitter node tree to the operative rows in an indentation suggestion query. Makes it possible to say things like “decrease the indent on line 10 if a statement ends on line 9.”
  • Renamed indentation directives indent.matchIndentOf and indent.offsetIndent to indent.match and indent.offset, respectively. The old names still work as aliases.

…and out of `WASMTreeSitterLanguageMode` for reasons of encapsulation.
Fixes issue with parsing of selectors in `:has`, `:is`, and other pseudoclasses.
@savetheclocktower
Copy link
Contributor Author

savetheclocktower commented Aug 23, 2024

I should add that the new indentation functionality doesn't have any tests yet, but neither does it change how existing stuff behaves, so I expect all tests to pass. If I don't manage to write tests for this stuff in the next three weeks, I might take some small steps to hide the new functionality, or at least mark it as obviously experimental and not something that should be relied upon.

(Update: we now have a spec for @match.next, so I consider this stuff to be stable enough to ship. At this point, any further changes that I make to how @match.next behaves can be done in a backwards-compatible way; so all that's left to do is document it, which I plan to do soon.)

…including more consistent firing of the `did-suggest-indent` event.
…and fix a bug with the indentation reparse budget.
…as booleans. Handle `NULL` in a similar way.
…and update our API usages.

Tree-sitter harmonized the API differences between `web-tree-sitter` and
`node-tree-sitter` in version 0.22.0. This is the first time we’ve had to deal
with that. Luckily, the changes were mostly palatable.

The biggest API difference is in `Query#captures`; two positional arguments for
defining the extent of the query have been moved to keyword arguments. We’ve
updated our internal usages, but any community packages that relied on the old
function signature would break if we didn’t do anything about it. So we’ve
wrapped the `Query#captures` method in one of our own; it detects usages that
expect the old signature and rearranges their arguments, issuing a deprecation
warning in the process. Hopefully this generates enough noise that any such
packages understand what’s going on and can update.

Other API changes are more obscure — which is good, because we can’t wrap them
the way we wrapped `Query#captures`. They involve conversion of functions to
getters (`node.hasErrors` instead of `node.hasErrors()`), and there’s no good
way to make both usages work… short of wrapping nodes in `Proxy` objects, and
that’s not on the table.

Since lots has changed in `tree-sitter` since we last upgraded
`web-tree-sitter`, I updated our documentation about building a custom version
of `web-tree-sitter`.
@savetheclocktower
Copy link
Contributor Author

Just bumped our web-tree-sitter to 0.23.0. This is a big bump and I'd been putting it off because of disruptive changes to the Tree-sitter ecosystem; but the short version is that everything's cool.

Immediately after I bumped web-tree-sitter locally, I reloaded my Pulsar window and found that everything was heartbreakingly slow. Profiling had me chase down a couple of red herrings before I discovered that the problem had to do with API changes in web-tree-sitter. This happened a while back for a good reason: the desire to eliminate differences between the web-tree-sitter API and the node-tree-sitter API.

The slowness had to do with the fact that every query against the tree was accidentally being executed against the entire tree rather than against the specified range. (Even this was fast on the web-tree-sitter side; the slowness happened in our post-processing phase, the one in which we iterate through all the captures.) Once I updated our usages of Query#captures to use its new signature, the problem went away. (One of the red herrings was still a good idea: a hot code path was allocating lots of unnecessary arrays, so I rewrote it to use a different approach.)

We can monkeypatch Query#captures to work with both the old function signature and the new one, so that's exactly what I did. The purpose is not so much to save us from rewriting our usages (since I've done that; it's in this PR) but to save the editor from throwing an error if a community package expects the old function signature. (On that note, I've also updated tree-sitter-tools to be more defensive and work equally well with the old API and the new API.) We'll issue a deprecation warning if the old function signature is detected.

Everything went well locally, including specs, so I'll watch the CI carefully just in case there's something really obscure that regressed.

@savetheclocktower savetheclocktower marked this pull request as ready for review September 10, 2024 18:17
@savetheclocktower
Copy link
Contributor Author

Enough last-minute changes; time to take this out of draft so that I stop clipping at the edges like it's a bonsai tree.

…for `tree-sitter-html` and `tree-sitter-embedded-template`.

Fixed an inscrutable out-of-memory error I was getting in an EJS file.
…related to updating open documents’ syntax highlighting as grammars are processed.
@savetheclocktower
Copy link
Contributor Author

Just kidding! I ran into a strange WASM memory error when working on an EJS document, and I couldn't figure out why it was happening, so eventually I decided “might as well update the relevant Tree-sitter parsers” and that solved it.

Along the way I discovered another subtle bug, so that's also been dealt with.

Copy link
Member

@confused-Techie confused-Techie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without diving too deep into the code, with all tests passing and the history of these bumps, I'm confident in merging this one.

But love to see lots of little changes in seemingly every grammar, it's quite impressive tbh

Comment on lines +208 to +210
// We do this because folds aggregate, so we can always either (a) delegate
// a job to an arbitrary layer's `FoldResolver`, or (b) ask _all_ layers to
// do something and then assemble the results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick comment that this paragraph is not terribly clear to someone (me, lol) unfamiliar with Tree-sitter nuts and bolts.

It's a little unclear to me how the "aggregating of folds" leads to the ability to "delegate a job to an arbitrary layer's [resolver]" or "ask all layers to do something and then assemble the results." Related to this: I have some questions: In this case, what is "something" as in "do something"? Can we do fairly arbitrary stuff, for example related to indenting or highlighting, in the Domain Specific Language Tree-sitter provides (seems like Tree-sitter grammars are in a Tree-sitter-specific DSL??) or is it stuff some arbitrary JavaScript consumer (community package) can do when getting information back from... Tree-sitter? The grammar itself? (Not sure how this works, TBH). Who (what code, what consumer) is "doing something" and who is "assembling the results" and what becomes of those "[assembled] results"?

Probably too many questions, but the type of questions someone (it's me again, lol) who doesn't already know the context in which these terms are being used might ask you if they were trying to dig into it for the first time.

One last thing, "aggregate" (verb) might be a precise word choice in this situation, but feels like it's at a high reading level without providing me clarity/insight, from the perspective of not already knowing what this actually means. Consider easier or lower-reading-level wording? If there are multiple sub-concepts being stated tersely in one go, consider unpacking them out? I take it to mean folds "combine" in a sense that either the result of the folding (decision??) is combined with the result of another fold (decision??) or perhaps that (if I recall correctly) there can be only one fold per line, so any fold reason is effectively more or less equivalent to there being any other fold reason on a given line... ??? But unclear where/when this "aggregation" is happening that we're referring to. Is this a code short-circuit quick path that saves us computing time, without downside to consumers/end-users (they don't have to think about it)? Does an end-user or consumer have to know the internals of this short-circuiting/aggregating, to predict the behaviors they would be interested in as a user/consumer?

Sorry to deep-dive, but I did want to give some feedback as to the readability of the comment. I like to praise comments, but here I'll giv a slightly hard time as you appear to be asking for this type of feedback in the PR body. So I'm letting 'er rip! Hope this is the kind of feedback you were hoping for.

Best,
-DeeDeeG

P.S. still reviewing, just wanted to finish this comment and post it separately since it turned out so long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear to me how the "aggregating of folds" leads to the ability to "delegate a job to an arbitrary layer's [resolver]" or "ask all layers to do something and then assemble the results."

Yeah, I struggle to explain this stuff sometimes.

A buffer consists of anywhere from 1 to Infinity “language layers.” An HTML file has a “root” layer for parsing/highlighting HTML. If the file has a single SCRIPT block with inline JS, that’s a second layer. If it also has a STYLE block with inline CSS, that’s a third. And so on.

If we want to get all the folds in a buffer, we ask all the language layers, and then we assemble their results. By and large, we just aggregate what each layer tells us. That’s why each layer has a FoldResolver. (I'm simplifying, but probably less than you think!)

By contrast, if we want to figure out the proper indentation level after a newline is inserted, that’s not something we find out by asking all the layers and aggregating the results. It’s a two-phase process, and in each phase we find one layer to ask. The layer is not guaranteed to be the same in both phases.

So it’s not a task for which we gain any simplicity by splitting up the work among different layers. That’s what I was trying to say. The task itself requires being aware of multiple language layers, so we'd actually be complicating the task if we gave each layer an IndentResolver.

Related to this: I have some questions: In this case, what is "something" as in "do something"? Can we do fairly arbitrary stuff, for example related to indenting or highlighting, in the Domain Specific Language Tree-sitter provides (seems like Tree-sitter grammars are in a Tree-sitter-specific DSL??) or is it stuff some arbitrary JavaScript consumer (community package) can do when getting information back from... Tree-sitter? The grammar itself? (Not sure how this works, TBH). Who (what code, what consumer) is "doing something" and who is "assembling the results" and what becomes of those "[assembled] results"?

Tree-sitter parses our code into a syntax tree. Highlighting, indentation, code folding, and symbol listing are all tasks that we can do by querying that tree. The query language is that Lisp-y stuff you see in the contents of .scm files.

If we found another useful way to query the tree and make users’ lives easier, we could do it. If a community package wanted to query the tree for its own purposes, it could; the API’s not very public or documented, but I’ve purposefully left some extension points.

The terminology is confusing because Tree-sitter also uses the word “grammar,” but I’ll use the word “parser” instead:

  • A language layer has a single WASMTreeSitterGrammar. That’s a Pulsar-specific thing.
  • A single WASMTreeSitterGrammar has a Language (a representation of a parser in web-tree-sitter) and some number of Query instances (also a web-tree-sitter interface) — one for highlights, one for folds, etc.
  • Query instances are created via (IIRC) Language::query; you give it a string with some query syntax (the contents of an SCM file).
  • You can use that Query instance to get matches via Query::captures(tree, options), where tree is a Tree-sitter tree and options can include things like an (optional) range of the buffer to query.
  • We get back results really fast and can do whatever we want with them. The results include node metadata (node name, text contents, buffer range) and capture metadata (capture name, any asserted/refuted properties — i.e., the scope tests we use to winnow results).
  • A community package could use WASMTreeSitterGrammar::createQuery to pass in some Tree-sitter query syntax and get a Query instance in return. To get a tree to query against, they can use some APIs on WASMTreeSitterLanguageMode to grab a given buffer's tree at “safe” points (i.e., the ends of buffer transactions). They can either analyze the tree immediately (since it won't be around indefinitely!) or copy it so that they can analyze the copy in their own time.

Sorry to deep-dive, but I did want to give some feedback as to the readability of the comment. I like to praise comments, but here I'll giv a slightly hard time as you appear to be asking for this type of feedback in the PR body. So I'm letting 'er rip! Hope this is the kind of feedback you were hoping for.

Yeah, all of that is fair. I’ll probably revise this comment in the future. Code comments don’t provide a clear “entry point” into this knowledge, so it’s hard to know how much knowledge to assume at a given location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me when writing this up, that some sort of gentle ramp up the learning curve, in the form of docs, might be the right way to do it.

If we have some "2 + 2 = 4" fundamentals as the "easiest/lowest-assumptions-made" level of that on-ramp, that'd be great for me, heh. But whatever adds context that's essential for being comfortable with this stuff but might be overwhelming as code comments. (Even if it's linking to some Tree-sitter project docs that they have, if any.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tapping out reading all of this refactor line by line before review.

I can appreciate the motivation given in the PR body as for why this is being done. So, I will take some of the content on trust, it is simply too large to review closely in a timely manner, IMO. But hopefully this brings benefits as laid out in the PR body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm angry at the diff; it really does overstate the change. I took three or four methods' worth of logic and put them in a new class at the bottom. The diff makes it seem like I changed this entire file!

Copy link
Member

@DeeDeeG DeeDeeG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments about code comments. (How meta? Not that kind of Meta, though.)

src/wasm-tree-sitter-language-mode.js Outdated Show resolved Hide resolved
Comment on lines 1459 to 1461

let indentCapturePosition = null;
let indentDelta = 0;
let dedentNextDelta = 0;
// DEPRECATED

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a blank line above and below this line reading // DEPRECATED. Is the code above or below the // DEPRECATED line deprecated, or both, or? Somewhat ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two methods below that line are what is deprecated. They're a very TextMate-like way of envisioning the buffer; they're there because some community packages (and probably some builtin packages) expect them to be there.

Comment on lines +5026 to +5032
// Look up an `indent.` capture property applied with a `#set!` directive,
// optionally coercing to a specified type or falling back to a default
// value.
//
// `names` can be an array in cases where the property may have several
// aliases. The first one that exists will be returned. (Omit the leading
// `indent.` when passing property names.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capture property applied with a #set! directive, [ . . . ]

Is this a verb "capture" or a noun "capture"? If a verb: Who captures the property? And dare I ask: why? Or if it's a noun "capture": is this telling us a capture property was applied with a #set! directive, and how does type coercion or fallbacks play into all that?

https://i.kym-cdn.com/photos/images/newsfeed/002/029/841/a26.png

Sorry, just surfacing some of my specific confusion areas around this subject, as I see it was requested in the PR body to ask clarifying questions if I don't understand something. And joking around a bit, I hope you don't mind the humor. (See PNG above.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the following line adds quite a bit of context, just from the function name and argument names alone, but still. I'm not sure I totally understand.

getProperty(capture, names, coercion = null, fallback = null) {

Copy link
Contributor Author

@savetheclocktower savetheclocktower Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this telling us a capture property was applied with a #set! directive, and how does type coercion or fallbacks play into all that?

It's a noun capture! Capture properties are things like the #set! clause in

((foo) @bar
 (#set! indent.offset 1))

Despite the fact that it doesn't require quotes, the 1 in this example will be returned to us as a string. It's all strings to Tree-sitter. Hence the coercion.

We also have indent.force (which should be coerced to a boolean), and indent.match) (which should stay a string). So three different data types — hence the coercion.

And joking around a bit, I hope you don't mind the humor. (See PNG above.)

STOP DOING HUMOR

  • words were not intended to be spoken deceitfully
  • YEARS of satire yet all we ever do is get MORE SARCASTIC
  • “An escalator can never break; it can only Become Stairs. I spilled spot remover on my dog; he’s gone now.” — Statements dreamed up by the Utterly Deranged

“Why did the chicken cross the road???“

they have played us for absolute fools

Copy link
Member

@DeeDeeG DeeDeeG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does seem potentially a lot smaller than it looks, assuming the diff for src/wasm-tree-sitter-language-mode.js is in fact mostly just refactoring existing code into a slightly different shape/format. (And should provide the opportunity to do some good things, so, worth it, it sounds like.)

Bumping various third-party dependencies for the grammars should be a good thing.

The tree-sitter-web related changes seem reasonable as explained.

I'm once again not in much of a position to deeply review the content, but I am trying to keep abreast of changes at the very least.

Nothing stands out at me that should need changes, to the extent I can review, at least without doing a ton of background learning that seems rather involved to catch up on first.

Anyway, 👍 from me as well.

@savetheclocktower savetheclocktower merged commit 836c348 into master Sep 17, 2024
104 checks passed
@savetheclocktower savetheclocktower deleted the tree-sitter-august branch September 17, 2024 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants