-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some optimizations (3) #401
base: master
Are you sure you want to change the base?
Conversation
Hi, I recently wrote a new crate that helps define tagged pointers. I haven't published it yet. Would you be interested in reviewing and using it? |
3b149ba
to
a32e718
Compare
a32e718
to
bc09acc
Compare
Speed: 400%
I discovered another representation of parent stacks: records the parent vertex of the last |
@Wilfred Hi, I wrote a summary that you might be interested in. |
f746078
to
813a16c
Compare
813a16c
to
a0c3293
Compare
@QuarticCat observed that popping delimiters is unnecessary, and saw a speedup in PR #401. This reduces the number of nodes in typical graphs by ~20%, reducing runtime and memory usage. This works because there is only one thing we can do at the end of a list: pop the delimiter. The syntax node on the other side does not give us more options, we have at most one. Popping all the delimiters as soon as possible is equivalent, and produces the same graph route. This change has also slightly changed the output of samples_files/slow_after.rs, producing a better (more minimal) diff. This is probably luck, due to the path-dependent nature of the route solving logic, but it's a positive sign. A huge thanks to @QuarticCat for their contributions, this is a huge speedup. Co-authored-by: QuarticCat <QuarticCat@pm.me>
I found that I forgot to update the latest progress (afe07ea - 80eae29). They virtually have no effect on the speed, but the memory usage dropped down to 125MB, which is ~23% of the original. To further optimize memory usage, you can:
These might give you another ~5MB reduction and that will probably result in a slightly better performance gained from better locality and better memory access pattern ( |
I guess you are not willing to adopt some of my optimizations, which is fine. But there are still some harmless yet beneficial optimizations:
BTW, I find a TODO in your current code:
I don't think this is a good idea, even a Vec is better than HashMap here. Yet you still have a big problem: how to implement According to my profiling results, most of the time was spent on hash map operations. We know that hash function randomly distributes keys into different slots. In a large hashmap (larger than your cache), that means perfect cache misses. I have a benchmark to verify this idea and I hope it can bring you some inspiration. |
Hey @QuarticCat! Yep, I won't accept all these commits in their current form, but there's so many good ideas here that I still want to look at and apply. For example, in #395 you make the really clever observation that a Vertex only ever has a I'm not sure about the exact wording of the APIs, and the I've also been thinking about the
This was inspired by on your work that made vertices smaller :). I was thinking that the HashMap would not be stored in Vertex, instead Vertex would only contain SyntaxId and I can have a separate global It's possible that the additional hashmap lookup would hurt performance, but overall (again based on your observations) it looks like performance is very sensitive to the size of Vertex. It depends on what the benchmarks show, of course. |
(I'm also still hopeful that I'll eventually get an A* implementation that's a performance win too. So many interesting avenues to explore!) |
Bump smoke test targets
A continuation of #393 and #395.
Use the same benchmark and baseline as in #395.
Speed: 383%
Memory: 31%
Two major optimizations:
(5e5eef2, 7e4e747) Use bitmap to represent the stack, with a constraint that stack size is no more than 63 (which represents 63 layers of delimiters, which should be rare), but can be easily extended.
(10ce859 to edc5516) Prune all
ExitDelimiter*
nodes and edges, which account for around 15%~25% of all nodes, depending on the input. This also fixes the following diff by the way:Before:
After:
I think my work deserves a version bump and maybe a reddit announcement XD.