Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Lazy-load syntaxes to greatly speed up startup time #374

Closed
wants to merge 5 commits into from

Conversation

Enselic
Copy link
Collaborator

@Enselic Enselic commented Oct 1, 2021

Hi! The purpose of this PR is to get early feedback on a crude but mostly functional proof-of-concept that greatly speeds up startup time. It does this by making syntaxes inside a SyntaxSet lazy-loaded.

This code already

  • Passes almost all syntect regression tests (only tests::can_parse_issue219 is failing)
  • Passes almost all bat regression tests
  • Passes almost all bat syntax highlighting regression tests

Would you be willing to consider merging something like this, after it has been turned into production quality code and all regression tests are passing? (You expressed interest for something along these lines in this comment: #340 (comment). You had impressive intuition, btw!)

Performance numbers

Let's compare the performance of doing syncat examples/synhtml.rs:

% hyperfine --export-markdown /dev/stdout -L syncat syncat,syncat-new '{syncat} examples/synhtml.rs'
Command Mean [ms] Min [ms] Max [ms] Relative
syncat examples/synhtml.rs 62.3 ± 2.8 58.1 74.6 3.65 ± 0.30
syncat-new examples/synhtml.rs 17.1 ± 1.2 15.2 20.4 1.00

As you can see, there is a significant speedup! But to really demonstrate the greatness (if I may say so) of this change, let's compare bat performance with and without this syntect improvement. First, on an empty Markdown file:

% touch Empty.md
% hyperfine --export-markdown /dev/stdout -L bat bat-new,bat '{bat}  Empty.md --force-colorization'
Command Mean [ms] Min [ms] Max [ms] Relative
bat-new Empty.md --force-colorization 21.5 ± 1.5 19.4 26.5 1.00
bat Empty.md --force-colorization 110.5 ± 2.1 106.5 114.4 5.14 ± 0.37

Looking good! But the cool thing is, if we embed some rust into the Markdown file, syntect will lazy load (and apply) the Rust syntax too, which you can see by the increased startup time:

% echo '```rust                                                                               
fn main() { println!("hi"); }
```' >> Empty.md
% hyperfine --export-markdown=/dev/stdout 'bat-new Empty.md --force-colorization'
Command Mean [ms] Min [ms] Max [ms] Relative
bat-new Empty.md --force-colorization 37.8 ± 1.5 35.3 42.8 1.00

How the lazy-loading works

  • Instead of storing all Contexts in one big array, we store each Context together with its SyntaxReference
  • We change ContextId to include both an index to a syntax, and then an index to a context within that syntax
  • We serialize and compress the bulk of a SyntaxReference (contexts, contexts map, variables), and only deserialize when it's is needed
  • This is crucial: We DO NOT compress the serialized version of SyntaxSet. There is no need, because the big data is already compressed, and to get fast deserialization of the binary data, it must not be compressed another time.

What about my previous plans?

Up until recently, my plan to improve bat startup performance was by splitting the giant single SyntaxSet up into many smaller pieces. And I managed to make it work well for syntaxes without dependencies. But it turned out to not work all the way, the main two reasons being:

  • There are a LOT of inter-dependencies between syntaxes. It is not possible to split them up into smaller parts without a lot of of duplications (in the order of megabytes)
  • syntect currently does not support building a SyntaxSetBuilder from many small SyntaxSets. It is certainly possible to make it work, but it will be messy. This must work for the custom assets feature of bat.

So I now think this is the way to go. It is (at least what it looks like so far) much simpler and much more efficient.

Looking forward to your (including your co-maintainers) thoughts on all this!

@Enselic
Copy link
Collaborator Author

Enselic commented Oct 5, 2021

Code has now been significantly cleaned up, and all syntect and bat regression tests pass (see Enselic/bat#50). So the code is relatively close to production quality, even though there is still work to be done on it.

However, feel very free to come with any kind of feedback. If you think I am totally on the wrong track it is completely fine to say so.

Maybe you wonder about performance for this code if there are lots of lazy-loaded syntaxes? The answer is that performance is still good. I made a Markdown file that embeds snippets of all its 18 supported languages, and performance is still significantly improved with lazy-loading. If one enables my debug-print, one can see "Syntax X lazy-loaded" being printed as the file is processed, interleaved with highlighted output.

The file for the benchmark below is: https://github.com/Enselic/bat/blob/117d5c0db6e86fd25fa9e823a2ff758fc2a26325/tests/benchmarks/test-src/all_markdown_syntaxes.md

hyperfine -L bat bat,bat-new '{bat} --pager=never --color=always --style=plain tests/benchmarks/test-src/all_markdown_syntaxes.md'
Command Mean [ms] Min [ms] Max [ms] Relative
bat --pager=never --color=always --style=plain tests/benchmarks/test-src/all_markdown_syntaxes.md 200.5 ± 5.6 194.7 215.4 1.46 ± 0.05
bat-new --pager=never --color=always --style=plain tests/benchmarks/test-src/all_markdown_syntaxes.md 137.7 ± 2.1 134.6 142.7 1.00

In short, deserialization time is negligable compared to the time it takes to do the actual highlighting (regex matching etc).

@Enselic Enselic marked this pull request as draft October 9, 2021 18:38
@trishume
Copy link
Owner

Thanks for doing this work! I haven't looked at the code yet but your previous PRs have been high quality and I appreciate the effort into benchmarking and testing. I think this approach to improving loading performance seems like a good one.

My life has been busy lately so I've been procrastinating on looking at PRs, although I expect it to get a bit less busy soon. I am in fact more likely to review promptly if you break it into smaller pieces, but this is currently at only a couple hundred lines of delta, which wouldn't be bad to review as one chunk. Just open things up for me to review and I'll hopefully get around to them.

Enselic added a commit to Enselic/syntect that referenced this pull request Oct 15, 2021
To make the upcoming diff for trishume#374 easier to read.
@Enselic
Copy link
Collaborator Author

Enselic commented Nov 25, 2021

All prototype code has been turned into production quality code now, so there is no need to keep this PoC around. See #398.

Closing.

@Enselic Enselic closed this Nov 25, 2021
@Enselic Enselic deleted the poc-lazy-load-syntaxes branch March 14, 2022 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants