-
-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gitoxide use significantly more memory than git when cloning semisynthetic repos #851
Comments
…much. (#851) Previously it would take a buffer from the free-list, copy data into it, and when exceeding the capacity loose it entirely. Now the freelist is handled correctly.
Previously, the 64 slot big LRU cache for pack deltas didn't use any memory limit which could lead to memory exhaustion in the face of untypical, large objects. Now we add a generous default limit to do *better* in such situations. It's worth noting though that that even without any cache, the working set of buffers to do delta resolution takes considerable memory, despite trying to keep it minimal. Note that for bigger objects, the cache is now not used at all, which probably leads to terrible performance as not even the base object can be cached.
Previously when traversing a pack it could appear to hang as checks were only performed on chunk or base (of a delta-tree) level. Now interrupt checks are performed more often to stop all work much quicker.
Previously, the 64 slot big LRU cache for pack deltas didn't use any memory limit which could lead to memory exhaustion in the face of untypical, large objects. Now we add a generous default limit to do *better* in such situations. It's worth noting though that that even without any cache, the working set of buffers to do delta resolution takes considerable memory, despite trying to keep it minimal. Note that for bigger objects, the cache is now not used at all, which probably leads to terrible performance as not even the base object can be cached.
#851) This change will put more of the delta-chain into the cache which possibly leads to increased chances of cache-hits if objects aren't queried in random order, but in pack-offset order. Note that in general, it tends to be faster to not use any cache at all. This change was pruned back right away as the major difference to git, which does it by storing every object of the chain in the cache, is that we don't share the cache among threads. This leaves a much smaller per-thread cache size which really is a problem if the objects are large. So instead of slowing pack access down by trying it, with the default cache being unsuitable as it would evict all the time due to memory overruns, we do nothing here and rather improve the performance when dealing with pathological cases during pack traversal.
…ry limits. (#851) Previously the 64 slot LRU cache didn't have any limit, now one is implemented that defaults to about 96MB.
Threads started for working on an entry in a slice can now see the amount of threads left for use (and manipulate that variable) which effectively allows them to implement their own parallelization on top of the current one. This is useful if there is there is very imbalanced work within the slice itself.
Previously we would use `resize(new_len, 0)` to resize buffers, even though these values would then be overwritten (or the buffer isn't available). Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe a memset.
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
Previously we would use `resize(new_len, 0)` to resize buffers, even though these values would then be overwritten (or the buffer isn't available). Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe a memset.
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
Threads started for working on an entry in a slice can now see the amount of threads left for use (and manipulate that variable) which effectively allows them to implement their own parallelization on top of the current one. This is useful if there is there is very imbalanced work within the slice itself.
Previously we would use `resize(new_len, 0)` to resize buffers, even though these values would then be overwritten (or the buffer isn't available). Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe a memset.
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
For completeness, here is an analysis of the pack, with the most impressive 1:1139 compression ratio:
Created with |
Threads started for working on an entry in a slice can now see the amount of threads left for use (and manipulate that variable) which effectively allows them to implement their own parallelization on top of the current one. This is useful if there is there is very imbalanced work within the slice itself. While at it, we also make consumer functions mutable as they exsit per thread.
Previously we would use `resize(new_len, 0)` to resize buffers, even though these values would then be overwritten (or the buffer isn't available). Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe a memset.
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
It's still a little rough around the edges, but thanks to improved parallelization the workload now finishes in
Peak-memory is relative to the amount of parallelism, so the usage is definitely higher than |
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
Now it's 3 times faster :)
Still pretty interesting why you have |
Three times isn't as much as I would have hoped for with a 32 core machine, but then again, maybe it is something about the hardware. I am using an M1 Pro and it did 'kick like a mule' in this workload, with the highest number I have ever seen due to overcommitting (something that shouldn't have happened but there might be something off about the thread-counting). Maybe it's that good because of the high memory throughput of 200GB/s? Knowing the delta-apply algorithm it's clear why it would be so intense. SHA1 performance is also important for this workload though, and typically the bottleneck. Maybe you can run some tests (once the PR is finalized today) with different thread-counts to see what happens - maybe some CPUs bottleneck elsewhere prefer lower core-counts. At least running it with less cores will keep the memory usage down, as 28GB seems a bit excessive even though there is nothing that can be done about it since it's strictly needed per-thread working memory which baloons thanks to the massive file sizes. Edit: I fixed the over-allocation issue and maybe that even made it a little slower on my machine (but better to leave over-commits in the hands of the user and the
Despite some details still to be sorted, I think from a performance point of view this is the final result. The above stems from To get memory usage down to about
I had to try harder to stay within the 5GB memory envelope and didn't manage even with
This definitely has a strong flavor of non-determinism as
And even though memory usage is still all over the place, another tweak later and the performance has more than recovered.
There is one more thing I can imagine has positive impact over memory usage, but it's a bit of a tough sell with current Rust where closures can't return data they own (or reference, for that matter). Another tweak later and it's now possible to avoid yet another data copy in the hopes that it reduces the stress on the allocator, but apparently to small effect except that it might have gotten a little faster. Here is the
And here is the same but with
Trying once more with even more optimal memory allocation, it got faster even though overall, the peak-usage didn't change much (
The same with
|
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
When delta-trees are unbalanced, in pathological cases it's possible that that one thread ends up with more than half of the work. In this case it's required that it manages to spawn its own threads to parallelize the work it has.
@0xdeafbeef I'd be very interested in seeing what the latest |
It likes like this also caused a regression 😅. Here is how fast it was:
And here is how slow it is now:
It seems that small-object performance suffered, but I am optimistic that it is possible to get that performance back. It's probably related to how buffers are handled now. |
Interestingly, it doesn't scale well from cpu count:
|
Thanks for sharing the measurements! It's definitely interesting to see how the sweet-spot seems to be around 12 threads, with the most efficient option being 8 threads or so. I think the CPU might be bottle-necking somewhere, as from my observations the work-stealing and synchronization around it isn't a problem at just ~200 objects per second, with each sync taking more than one piece of work from the queue. On a Ryzen, I have seen it scale to 35GB/s decoding speed which, yes, means it can decode 100GB of kernel code in just about 3.3s 😅. What's more alarming is the memory usage, which is way too high even with |
It looks like |
I finally managed to get another measurement on a 128core Intel machine with 256GB of RAM.
Performance isn't as crazy as it probably should be, but fortunately peak memory usage isn't proportional to the core count either. The major number here is 43.2GB/s, which is three times faster than my M1 Pro, so roughly equivalent to 30 cores. Since it's a cloud machine, it might be throttled in various ways though. |
Duplicates
Current behavior 😯
Regular git uses 5gb of ram for indexation and spends
55m47s
on clone:gix uses 28gb of ram for indexation and spends
1h31m
on clone:All tests were done on tmpfs in RAM
Expected behavior 🤔
No response
Steps to reproduce 🕹
The text was updated successfully, but these errors were encountered: