Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Exploration] Tree-sitter tokenization exploration (Fixes #161256) #161479

Closed
wants to merge 64 commits into from

Conversation

aiday-mar
Copy link
Contributor

@aiday-mar aiday-mar commented Sep 22, 2022

Notes about the draft PR

Feature request for the issue #161256.

  • There are now two actions that can be toggled from the command palette: one called Toggle Synchronous Tree-Sitter Colorization and the other Toggle Asynchronous Tree-Sitter Colorization. The two differ only by the value of the asynchronous boolean which is passed in as parameter during the instantiation of the tree-sitter colorization trees. This boolean determines whether the subsequent tree-parsing operation will be synchronous or asynchronous. The colorization operation will always be synchronous. These actions log into the console the execution time for the parsing, querying and colorization operations. They also log the number of calls to the corresponding asynchonous methods. They were used to get the performance measurements below.
  • The tests fail as for some reason the wrong bindings file is imported from the web-tree-sitter library. I.e: instead of importing the TypeScript binding it imports the C bindings. For the moment the tests are skipped with test.skip().
  • Inside of the file treeSitterService.ts, there is a function getTreeSitterTree(). It's sole purpose is to retrieve the tree-sitter tree for the purpose of testing (see the testing file). Similarly, inside of the file treeSitterTree.ts, there is a function parseTreeAndCountCalls(). It is used only in the testing file for testing the amount of calls to the _parseTree() function. When the boolean asynchronous is set to True, the _parseTree() function will first try to parse the tree synchronously and if this fails because of a timeout it will parse the tree asynchronously. Otherwise when the boolean is false, it will always parse the tree synchronously.
  • The treeSitterColorizationQueries.scm file contains the query results needed for colorizing the tokens in the editor. The names of the capture groups are mapped to the text-mate inner-most scope names.
  • In the file colorThemeData.ts, the method getTokenColorIndex() was made public in order to be able to perform the colorization. For the same reason, in the file contiguousMultilineTokens.ts there is a new setter method for the _startLineNumber member.
  • All exploratory work previously done for the folding model was removed as this was not at an advanced enough stage to add into this PR.

Comparison of the current tokenization/colorization system with the tree-sitter exploratory implementation

memory

  • tree-sitter-typescript.wasm file is 1300 kB
  • web-tree-sitter library when unpacked has size 246 kB
  • TODO: Tried to use webpack-bundle-analyzer (among others) in order to find the effect of the web-tree-sitter library on the bundle size. Attempt to display the data from the stats.json shows empty visualization?

performance

In order to get the results below a new boolean parameter asynchronous has been created which controls whether the tree parsing operation is synchronous or asynchronous. The colorization (set tokens operation) is always synchronous. Both synchronous and asynchronous actions previously described are toggled 3 times on 6 different files (from the TypeScript repo) and the average as well as the median are displayed in the table below. Some clarifications about the data:

  • The first table shows the output when running the Force Retokenize command from the command palette.
  • Time to parse tree: Corresponds to the time needed to parse the tree before querying it. Since the library is optimized so that upon text edits it reparses the tree using the old tree, the only relevant measurement is for the initial tree parsing (when there is no old tree).
  • Time to get query captures: Corresponds to the time needed to retrieve the query captures which are used for the colorization. Initially the captures corresponding to the entire file are fetched. Following this, only the captures corresponding to the lines on, or below, the first line changed in a text edit, are retrieved.
  • Time to set tokens: Corresponds to the time needed to colorize the scopes and set the contiguous multiline tokens.

Current

File size in kB lines Force Retokenization (milliseconds)
perfLogger.ts 2 43 7
tsbuild.ts 6 174 26
sys.ts 91 1974 365
scanner.ts 151 2636 1551
emitter.ts 281 5989 4643
checker.ts 2000 45861 12142

Synchronous

File size in kB lines Time to parse tree (milliseconds) - Try 1 Time to parse tree (milliseconds) - Try 2 Time to parse tree (milliseconds) - Try 3 Time to get query captures (milliseconds) - Try 1 Time to get query captures (milliseconds) - Try 2 Time to get query captures (milliseconds) - Try 3 Time to set tokens (milliseconds) - Try 1 Time to set tokens (milliseconds) - Try 2 Time to set tokens (milliseconds) - Try 3
perfLogger.ts 2 43 1.20 1.19 1.10 0.29 0.19 0.39 3.89 2.60 3.40
tsbuild.ts 6 174 1.10 1.89 1.79 0.39 0.5 0.29 5.9 8.4 6.9
sys.ts 91 1974 24.4 23.3 31 11.1 10 7.9 72.9 61.3 55.2
scanner.ts 151 2636 49.1 45.6 48.7 23.1 30.1 19.8 184.3 147.6 141.6
emitter.ts 281 5989 63.2 62.4 64.8 33.2 24.7 24.3 177.6 159.2 160.5
checker.ts 2000 45861 542.8 521.9 586.4 243.7 236.2 271.9 1505 1413 1816
File size in kB lines Time to parse tree (milliseconds) - Average Time to parse tree (milliseconds) - Median Time to get query captures (milliseconds) - Average Time to get query captures (milliseconds) - Median Time to set tokens (milliseconds) - Average Time to set tokens (milliseconds) - Median
perfLogger.ts 2 43 1.16 1.19 0.29 0.29 3.30 3.40
tsbuild.ts 6 174 1.59 1.79 0.39 0.39 7.06 6.9
sys.ts 91 1974 26.2 24.4 9.6 10 63.1 61.3
scanner.ts 151 2636 47.8 48.7 24.3 23.1 157.8 147.6
emitter.ts 281 5989 63.4 63.2 27.4 24.7 165.8 160.5
checker.ts 2000 45861 550.3 542.8 250.6 243.7 1578 1505

bar-graph (5)

Asynchronous

File size in kB lines Time to parse tree (milliseconds) - Try 1 Time to parse tree (milliseconds) - Try 2 Time to parse tree (milliseconds) - Try 3 Time to get query captures (milliseconds) - Try 1 Time to get query captures (milliseconds) - Try 2 Time to get query captures (milliseconds) - Try 3 Time to set tokens (milliseconds) - Try 1 Time to set tokens (milliseconds) - Try 2 Time to set tokens (milliseconds) - Try 3
perfLogger.ts 2 43 1.5 1.1 1 0.29 0.20 0.29 2.79 2.89 2.79
tsbuild.ts 6 174 1.29 1.69 5.59 0.79 0.5 0.39 8.19 8.59 6.60
sys.ts 91 1974 32.9 34.4 30.8 8.5 8.5 8.4 77.2 96.9 64.1
scanner.ts 151 2636 62 55.7 61.6 18.9 20 20.3 155 154 159
emitter.ts 281 5989 73.3 74.1 71 24.7 25.8 25.5 183 178 187
checker.ts 2000 45861 661.7 641.8 647.9 249.8 268.7 239.4 1492 1515 1626
File size in kB lines Time to parse tree (milliseconds) - Average Time to parse tree (milliseconds) - Median Time to get query captures (milliseconds) - Average Time to get query captures (milliseconds) - Median Time to set tokens (milliseconds) - Average Time to set tokens (milliseconds) - Median
perfLogger.ts 2 43 1.2 1.1 0.26 0.29 2.82 2.79
tsbuild.ts 6 174 2.85 1.69 0.56 0.50 7.79 8.19
sys.ts 91 1974 32.6 32.9 8.5 8.5 79.4 77.2
scanner.ts 151 2636 59.8 61.6 19.7 20 156 155
emitter.ts 281 5989 72.8 73.3 25.3 25.5 182 183
checker.ts 2000 45861 650.5 647.9 252.6 249.8 1544 1515

bar-graph (7)

aiday-mar added 30 commits August 31, 2022 16:35
… to also implement a system which keeps track of the changes.
@aiday-mar aiday-mar added this to the On Deck milestone Sep 27, 2022
@aiday-mar aiday-mar changed the title Draft PR for the tree-sitter tokenization exploration Draft PR for the tree-sitter tokenization exploration (Fix #161256) Sep 27, 2022
@aiday-mar aiday-mar changed the title Draft PR for the tree-sitter tokenization exploration (Fix #161256) Draft PR for the tree-sitter tokenization exploration (Fixes #161256) Sep 27, 2022
@jasonwilliams
Copy link
Contributor

jasonwilliams commented Jan 2, 2023

Hi @aiday-mar was you planning to continue experimenting with this? Is this something you or @alexdima needed help on?

I know there were questions around Tree Sitter's parser ABI versioning. I reached out to tree sitter maintainer Andrew Hlynskyi who responded:

I think you may look into the direction that grammars contributed by extensions can be contributed in a form that every grammar may be statically linked to a compatible Tree-sitter parser C library. Currently stripped version of the libtree-sitter.so.0.0 has a size of 188K, this will make some additional memory footprint but completely resolves an issue with the mentioned LANGUAGE_VERSION. Tree-sitter's parser C API is pretty stable and there are no breaking changes in it for a long time. And it's pretty easy to add such linking functionality right into tree-sitter CLI to cover such a case and automate the process if that would be on demand.

@haikyuu
Copy link

haikyuu commented Feb 2, 2023

Great work here
I suggest using conventional names in the queries: highlights.scm ...

This way, adding a grammar just requires adding the queries folder predefined in the tree-sitter repo.

@zm-cttae
Copy link

zm-cttae commented Feb 2, 2023

The work here may have also been channeled into the Anycode extension or vice versa!

@zm-cttae
Copy link

zm-cttae commented Apr 7, 2023

BTW, GitHub can now render native interactive bar charts :) enjoy

https://github.com/mermaid-js/mermaid#bar-chart-using-gantt-chart-docs---live-editor

@aiday-mar aiday-mar changed the title Draft PR for the tree-sitter tokenization exploration (Fixes #161256) Tree-sitter tokenization exploration (Fixes #161256) Jun 23, 2023
@aiday-mar aiday-mar added exploration and removed feature-request Request for new features or functionality labels Jul 10, 2023
@aiday-mar aiday-mar changed the title Tree-sitter tokenization exploration (Fixes #161256) [Exploration] Tree-sitter tokenization exploration (Fixes #161256) Jul 20, 2023
@aiday-mar
Copy link
Contributor Author

PR will be closed

@aiday-mar aiday-mar closed this Oct 12, 2023
@zm-cttae
Copy link

zm-cttae commented Oct 12, 2023

(Sorry any little clarification for us? Did this pull fall off the deck in favour of other features and hotfixes?
Or has the status of the Tree-sitter issue recently changed in core team's view?)

@justinmk
Copy link

  1. "PR will be closed".
  2. Closes PR.
  3. "PR is now closed".

@aiday-mar
Copy link
Contributor Author

Hi thank you for asking @zm-cttae. This PR was made with the purpose to explore the usage of tree-sitter in VS Code tokens colorization. Currently our efforts are pivoted towards developing Copilot and it is not on the roadmap to merge this work.

@zm-cttae
Copy link

Okay thank you for the answer & hopefully this work gets revisited 🤞

@github-actions github-actions bot locked and limited conversation to collaborators Nov 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants