-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC: Lazy-load syntaxes to greatly speed up startup time #374
Conversation
Still two bat syntax highlighting tests failing for CpuInfo and MemInfo for some reason, but it must be something silly that causes it.
Code has now been significantly cleaned up, and all syntect and bat regression tests pass (see Enselic/bat#50). So the code is relatively close to production quality, even though there is still work to be done on it. However, feel very free to come with any kind of feedback. If you think I am totally on the wrong track it is completely fine to say so. Maybe you wonder about performance for this code if there are lots of lazy-loaded syntaxes? The answer is that performance is still good. I made a Markdown file that embeds snippets of all its 18 supported languages, and performance is still significantly improved with lazy-loading. If one enables my debug-print, one can see "Syntax X lazy-loaded" being printed as the file is processed, interleaved with highlighted output. The file for the benchmark below is: https://github.com/Enselic/bat/blob/117d5c0db6e86fd25fa9e823a2ff758fc2a26325/tests/benchmarks/test-src/all_markdown_syntaxes.md
In short, deserialization time is negligable compared to the time it takes to do the actual highlighting (regex matching etc). |
Thanks for doing this work! I haven't looked at the code yet but your previous PRs have been high quality and I appreciate the effort into benchmarking and testing. I think this approach to improving loading performance seems like a good one. My life has been busy lately so I've been procrastinating on looking at PRs, although I expect it to get a bit less busy soon. I am in fact more likely to review promptly if you break it into smaller pieces, but this is currently at only a couple hundred lines of delta, which wouldn't be bad to review as one chunk. Just open things up for me to review and I'll hopefully get around to them. |
To make the upcoming diff for trishume#374 easier to read.
All prototype code has been turned into production quality code now, so there is no need to keep this PoC around. See #398. Closing. |
Hi! The purpose of this PR is to get early feedback on a crude but mostly functional proof-of-concept that greatly speeds up startup time. It does this by making syntaxes inside a
SyntaxSet
lazy-loaded.This code already
tests::can_parse_issue219
is failing)Would you be willing to consider merging something like this, after it has been turned into production quality code and all regression tests are passing? (You expressed interest for something along these lines in this comment: #340 (comment). You had impressive intuition, btw!)
Performance numbers
Let's compare the performance of doing
syncat examples/synhtml.rs
:syncat examples/synhtml.rs
syncat-new examples/synhtml.rs
As you can see, there is a significant speedup! But to really demonstrate the greatness (if I may say so) of this change, let's compare
bat
performance with and without thissyntect
improvement. First, on an empty Markdown file:bat-new Empty.md --force-colorization
bat Empty.md --force-colorization
Looking good! But the cool thing is, if we embed some
rust
into the Markdown file, syntect will lazy load (and apply) the Rust syntax too, which you can see by the increased startup time:bat-new Empty.md --force-colorization
How the lazy-loading works
Context
s in one big array, we store eachContext
together with itsSyntaxReference
ContextId
to include both an index to a syntax, and then an index to a context within that syntaxSyntaxReference
(contexts, contexts map, variables), and only deserialize when it's is neededSyntaxSet
. There is no need, because the big data is already compressed, and to get fast deserialization of the binary data, it must not be compressed another time.What about my previous plans?
Up until recently, my plan to improve bat startup performance was by splitting the giant single
SyntaxSet
up into many smaller pieces. And I managed to make it work well for syntaxes without dependencies. But it turned out to not work all the way, the main two reasons being:SyntaxSetBuilder
from many smallSyntaxSet
s. It is certainly possible to make it work, but it will be messy. This must work for the custom assets feature of bat.So I now think this is the way to go. It is (at least what it looks like so far) much simpler and much more efficient.
Looking forward to your (including your co-maintainers) thoughts on all this!