-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load independent and minimal syntax sets when using --language #1787
Load independent and minimal syntax sets when using --language #1787
Conversation
I took a first look at the code. It looks great - thank you! I have a question before I get into a more detailed review: Why did you decide to perform the "offset and size" handling yourself, instead of trying to shift that work to I haven't looked into the details, but I could imagine that we could ask |
Interesting idea! I would be fine to explore that method when I get time before you do a detailed review. It seems like it should be quick to deserialize raw data, even if that ends up being measured in megs (which would be the case when So let me try this out before we decide what method to use, and before detailed code review 👍 |
Hm, I wouldn't sacrifice this much, performance-wise. I was kind of hoping that |
Hmm yes serde does indeed support zero-copy deserialization: https://serde.rs/lifetimes.html If I could get that to work that would simplify the code a lot Definitely worth exploring more, so I'll do that |
This significantly speeds up the startup time of bat, since only a single linked SyntaxDefinition is loaded for each file. The size increase of the binary is just ~400 kB. In order for startup time to be improved, the --language arg must be used, and it must match one of the following names: "Plain Text", "ActionScript", "AppleScript", "Batch File", "NAnt Build File", "C#", "C", "CSS", "D", "Diff", "Erlang", "Go", "Haskell", "JSON", "Java Properties", "BibTeX", "LaTeX Log", "TeX", "Lisp", "Lua", "MATLAB", "Pascal", "R", "Regular Expression", "Rust", "SQL", "Scala", "Tcl", "XML", "YAML", "Apache Conf", "ARM Assembly", "Assembly (x86_64)", "CMakeCache", "Comma Separated Values", "Cabal", "CoffeeScript", "CpuInfo", "Dart Analysis Output", "Dart", "Dockerfile", "DotENV", "F#", "Friendly Interactive Shell (fish)", "Fortran (Fixed Form)", "Fortran (Modern)", "Fortran Namelist", "fstab", "GLSL", "GraphQL", "Groff/troff", "group", "hosts", "INI", "Jinja2", "jsonnet", "Kotlin", "Less", "LLVM", "Lean", "MemInfo", "Nim", "Ninja", "Nix", "passwd", "PowerShell", "Protocol Buffer (TEXT)", "Puppet", "Rego", "resolv", "Robot Framework", "SML", "Strace", "Stylus", "Solidity", "Vyper", "Swift", "SystemVerilog", "TOML", "Terraform", "TypeScript", "TypeScriptReact", "Verilog", "VimL", "Zig", "gnuplot", "log", "requirements.txt", "Highlight non-printables", "Private Key", "varlink" Later commits will improve startup time for more code paths.
2403984
to
a4fb754
Compare
With all preparatory work and investigations done, it is now time to re-visit this PR! 🎉 The code probably needs some more polishing, but it is definitely ready for at least a high-level review. I would say it is probably also ready for a detailed review. Turns out it is not necessary to do zero-copy deserialization. Deserialization into I have verified that the performance numbers are practically the same with this new code as they are in the PR description. Startup time in loop-through mode (see #1747) has become a tiny bit slower since we don't do lazy deserialization of
Note that I have ended up changing terminology from "independent syntax sets" to "minimal syntax sets" because the latter is both easier to write, and more accurate. Looking forward to your comments on this new code! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great - Thank you so much! Really looking forward to this.
I added a few (very minor) review comments.
Another question, purely for my understanding: Is it true that we can potentially get (almost) rid of the 400KB overhead at some point? Once the syntaxes in minimal sets can be found by each available code-path/method, it shouldn't be required to add them again to the large syntax set, right? |
If everything goes according to plan, we will get rid of Note that the 400kB overhead is temporary. As we add more syntaxes to I am keeping the known remaining steps up to date in this comment: #951 (comment) (I know that GitHub does not send emails for edits of comments, but I still try to keep that comment up to date nevertheless) |
Left to do before this can be merged:
|
I have now done some additional verification, and everything seems to work as it should. So I will merge this shortly. I also updated CHANGELOG.md. But the changes are to be seen as preliminary. We will want to look over the Performance section when it is time to make the next release, I think. |
Ok, so now the #951 work is starting to get interesting.
This PR improves startup time in the following scenario:
--language
is usedTo keep small and digestible PRs, that is the only scenario that this PR improves startup time for. The plan is to improve startup time for all code paths in the upcoming PR.
The binary size only increases by ~400 kB as a result of this PR. That is because each
SyntaxSet
contains only oneSyntaxDefinition
, so it is not possible to optimize further (with currentsyntect
data structures).List of syntaxes that this PR (should) improve startup time for:
Some example benchmarks:
bat-pr -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt
bat -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt
bat-pr -f --language c tests/benchmarks/test-src/miniz.c
bat -f --language c tests/benchmarks/test-src/miniz.c
bat-pr -f --language rust examples/simple.rs
bat -f --language rust examples/simple.rs
NOTE: I am not updating
syntaxes.bin
, so e.g. the HTTP syntax is missing from that file, but present inindependent_syntax_sets.bin
, which can be a bit confusing unless you know about it.NOTE: I am reserving the right to change the format of the new binary files in incompatible ways, until we have made a release. After we have made a release, we should try to be backwards compatible, of course
I have not yet thoroughly verified this change for edge cases etc, but all regression tests pass, and I'm pretty happy with the code, so this is is certainly ready for a real code review round.