PoC: Lazy-load syntaxes to greatly speed up startup time #374

Enselic · 2021-10-01T20:56:58Z

Hi! The purpose of this PR is to get early feedback on a crude but mostly functional proof-of-concept that greatly speeds up startup time. It does this by making syntaxes inside a SyntaxSet lazy-loaded.

This code already

Passes almost all syntect regression tests (only tests::can_parse_issue219 is failing)
Passes almost all bat regression tests
Passes almost all bat syntax highlighting regression tests

Would you be willing to consider merging something like this, after it has been turned into production quality code and all regression tests are passing? (You expressed interest for something along these lines in this comment: #340 (comment). You had impressive intuition, btw!)

Performance numbers

Let's compare the performance of doing syncat examples/synhtml.rs:

% hyperfine --export-markdown /dev/stdout -L syncat syncat,syncat-new '{syncat} examples/synhtml.rs'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`syncat examples/synhtml.rs`	62.3 ± 2.8	58.1	74.6	3.65 ± 0.30
`syncat-new examples/synhtml.rs`	17.1 ± 1.2	15.2	20.4	1.00

As you can see, there is a significant speedup! But to really demonstrate the greatness (if I may say so) of this change, let's compare bat performance with and without this syntect improvement. First, on an empty Markdown file:

% touch Empty.md
% hyperfine --export-markdown /dev/stdout -L bat bat-new,bat '{bat}  Empty.md --force-colorization'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat-new Empty.md --force-colorization`	21.5 ± 1.5	19.4	26.5	1.00
`bat Empty.md --force-colorization`	110.5 ± 2.1	106.5	114.4	5.14 ± 0.37

Looking good! But the cool thing is, if we embed some rust into the Markdown file, syntect will lazy load (and apply) the Rust syntax too, which you can see by the increased startup time:

% echo '```rust                                                                               
fn main() { println!("hi"); }
```' >> Empty.md
% hyperfine --export-markdown=/dev/stdout 'bat-new Empty.md --force-colorization'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat-new Empty.md --force-colorization`	37.8 ± 1.5	35.3	42.8	1.00

How the lazy-loading works

Instead of storing all Contexts in one big array, we store each Context together with its SyntaxReference
We change ContextId to include both an index to a syntax, and then an index to a context within that syntax
We serialize and compress the bulk of a SyntaxReference (contexts, contexts map, variables), and only deserialize when it's is needed
This is crucial: We DO NOT compress the serialized version of SyntaxSet. There is no need, because the big data is already compressed, and to get fast deserialization of the binary data, it must not be compressed another time.

What about my previous plans?

Up until recently, my plan to improve bat startup performance was by splitting the giant single SyntaxSet up into many smaller pieces. And I managed to make it work well for syntaxes without dependencies. But it turned out to not work all the way, the main two reasons being:

There are a LOT of inter-dependencies between syntaxes. It is not possible to split them up into smaller parts without a lot of of duplications (in the order of megabytes)
syntect currently does not support building a SyntaxSetBuilder from many small SyntaxSets. It is certainly possible to make it work, but it will be messy. This must work for the custom assets feature of bat.

So I now think this is the way to go. It is (at least what it looks like so far) much simpler and much more efficient.

Looking forward to your (including your co-maintainers) thoughts on all this!

Still two bat syntax highlighting tests failing for CpuInfo and MemInfo for some reason, but it must be something silly that causes it.

Enselic · 2021-10-05T20:14:22Z

Code has now been significantly cleaned up, and all syntect and bat regression tests pass (see Enselic/bat#50). So the code is relatively close to production quality, even though there is still work to be done on it.

However, feel very free to come with any kind of feedback. If you think I am totally on the wrong track it is completely fine to say so.

Maybe you wonder about performance for this code if there are lots of lazy-loaded syntaxes? The answer is that performance is still good. I made a Markdown file that embeds snippets of all its 18 supported languages, and performance is still significantly improved with lazy-loading. If one enables my debug-print, one can see "Syntax X lazy-loaded" being printed as the file is processed, interleaved with highlighted output.

The file for the benchmark below is: https://github.com/Enselic/bat/blob/117d5c0db6e86fd25fa9e823a2ff758fc2a26325/tests/benchmarks/test-src/all_markdown_syntaxes.md

hyperfine -L bat bat,bat-new '{bat} --pager=never --color=always --style=plain tests/benchmarks/test-src/all_markdown_syntaxes.md'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat --pager=never --color=always --style=plain tests/benchmarks/test-src/all_markdown_syntaxes.md`	200.5 ± 5.6	194.7	215.4	1.46 ± 0.05
`bat-new --pager=never --color=always --style=plain tests/benchmarks/test-src/all_markdown_syntaxes.md`	137.7 ± 2.1	134.6	142.7	1.00

In short, deserialization time is negligable compared to the time it takes to do the actual highlighting (regex matching etc).

trishume · 2021-10-15T00:53:16Z

Thanks for doing this work! I haven't looked at the code yet but your previous PRs have been high quality and I appreciate the effort into benchmarking and testing. I think this approach to improving loading performance seems like a good one.

My life has been busy lately so I've been procrastinating on looking at PRs, although I expect it to get a bit less busy soon. I am in fact more likely to review promptly if you break it into smaller pieces, but this is currently at only a couple hundred lines of delta, which wouldn't be bad to review as one chunk. Just open things up for me to review and I'll hopefully get around to them.

To make the upcoming diff for trishume#374 easier to read.

Enselic · 2021-11-25T18:59:19Z

All prototype code has been turned into production quality code now, so there is no need to keep this PoC around. See #398.

Closing.

Enselic added 5 commits October 1, 2021 22:07

PoC: Lazy-load syntaxes to greatly speed up startup time

6da85aa

ignore tests::can_parse_issue219, should make CI green

67d6032

CI.yml: Add missing 'dump-create', _now_ CI should be green

275b44b

Merge remote-tracking branch 'origin/master' into poc-lazy-load-syntaxes

db9c422

Clean up code and make all syntect and bat regression tests pass

1d7806c

Still two bat syntax highlighting tests failing for CpuInfo and MemInfo for some reason, but it must be something silly that causes it.

Enselic mentioned this pull request Oct 6, 2021

Make ContextId implement Hash and simplify the code #377

Merged

Enselic marked this pull request as draft October 9, 2021 18:38

Enselic added a commit to Enselic/syntect that referenced this pull request Oct 15, 2021

Introduce find_unlinked_contexts_in_context() helper

8ba2205

To make the upcoming diff for trishume#374 easier to read.

Enselic mentioned this pull request Oct 15, 2021

Introduce find_unlinked_contexts_in_context() helper #378

Merged

Enselic closed this Nov 25, 2021

Enselic deleted the poc-lazy-load-syntaxes branch March 14, 2022 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: Lazy-load syntaxes to greatly speed up startup time #374

PoC: Lazy-load syntaxes to greatly speed up startup time #374

Enselic commented Oct 1, 2021 •

edited

Loading

Enselic commented Oct 5, 2021 •

edited

Loading

trishume commented Oct 15, 2021

Enselic commented Nov 25, 2021

PoC: Lazy-load syntaxes to greatly speed up startup time #374

PoC: Lazy-load syntaxes to greatly speed up startup time #374

Conversation

Enselic commented Oct 1, 2021 • edited Loading

Performance numbers

How the lazy-loading works

What about my previous plans?

Enselic commented Oct 5, 2021 • edited Loading

trishume commented Oct 15, 2021

Enselic commented Nov 25, 2021

Enselic commented Oct 1, 2021 •

edited

Loading

Enselic commented Oct 5, 2021 •

edited

Loading