Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitoxide use significantly more memory than git when cloning semisynthetic repos #851

Closed
1 task done
0xdeafbeef opened this issue May 13, 2023 · 11 comments · Fixed by #852
Closed
1 task done

Gitoxide use significantly more memory than git when cloning semisynthetic repos #851

0xdeafbeef opened this issue May 13, 2023 · 11 comments · Fixed by #852

Comments

@0xdeafbeef
Copy link

Duplicates

  • I have searched the existing issues

Current behavior 😯

Regular git uses 5gb of ram for indexation and spends 55m47s on clone:

/usr/bin/time -vv  git clone https://github.com/fz139/vigruzki.git vigruzki2
Cloning into 'vigruzki2'...
remote: Enumerating objects: 570224, done.
remote: Counting objects: 100% (8218/8218), done.
remote: Compressing objects: 100% (4097/4097), done.
remote: Total 570224 (delta 4128), reused 8211 (delta 4121), pack-reused 562006
Receiving objects: 100% (570224/570224), 5.74 GiB | 14.37 MiB/s, done.
Resolving deltas: 100% (240329/240329), done.
        Command being timed: "git clone https://github.com/fz139/vigruzki.git vigruzki2"
        User time (seconds): 12655.67
        System time (seconds): 3815.96
        Percent of CPU this job got: 492%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 55:47.09
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5335480
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 2987959137
        Voluntary context switches: 6497363
        Involuntary context switches: 2048852
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

gix uses 28gb of ram for indexation and spends 1h31m on clone:

 /usr/bin/time -vv gix clone https://github.com/fz139/vigruzki.git
 20:27:37 read pack done 6.2GB in 437.75s (14.1MB/s)
 20:27:37  indexing done 570.2k objects in 428.24s (1.3k objects/s)
 20:27:37 decompressing done 13.6GB in 428.24s (31.8MB/s)
 21:51:31     Resolving done 570.2k objects in 5033.41s (113.0 objects/s)
 21:51:31      Decoding done 7.0TB in 5033.41s (1.4GB/s)
 21:51:31 writing index file done 19.4MB in 0.03s (626.6MB/s)
 21:51:31  create index file done 570.2k objects in 5461.80s (104.0 objects/s)
 21:51:31           checkout done 5.0 files in 0.35s (14.0 files/s)
 21:51:31            writing done 245.2MB in 0.35s (705.7MB/s)
HEAD:refs/remotes/origin/HEAD (implicit)
        1b49858194dc61cb0349a14a6d81a6061dde23b7 HEAD symref-target:refs/heads/main -> refs/remotes/origin/HEAD [new]
+refs/heads/*:refs/remotes/origin/*
        1b49858194dc61cb0349a14a6d81a6061dde23b7 refs/heads/main -> refs/remotes/origin/main [new]
        Command being timed: "gix clone https://github.com/fz139/vigruzki.git"
        User time (seconds): 7947.84
        System time (seconds): 929.27
        Percent of CPU this job got: 162%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31:12
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 27360768
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 468633190
        Voluntary context switches: 3885411
        Involuntary context switches: 1039556
        Swaps: 0
        File system inputs: 56
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

All tests were done on tmpfs in RAM

git version 2.40.1

gix --version
gix-plumbing 0.25.0

Kernel: 6.2.11-300.fc38.x86_64
CPU: AMD Ryzen 9 7950X (32) @ 4.500GHz

Memory: 127923MiB

Expected behavior 🤔

No response

Steps to reproduce 🕹

  1. gix clone https://github.com/fz139/vigruzki.git
@Byron
Copy link
Member

Byron commented May 14, 2023

Thanks a lot for reporting this incredibly interesting case!

My hypothesis for what's happening here is the following: there are three big files with 100MB each and nearly 110k commits which update at least one, if not all, data dumps each commit. Maybe it's a log-like append-only file format which means delta compression will have a good time building very long chains.

Indexing 540k objects will not use 28GB of RAM, and 10m objects aren't a problem either. However, gitoxide parallelizes over all 32 cores and traverses the delta-chains. Per thread, it keeps at least one base in memory, applies a delta, to generate the next base. Depending on the delta-trees topology, it may have to keep multiple bases in memory at the same time, and all that per thread. In the best case, each thread will have a buffer for the base object, along with a buffer for the undeltified destination that will be the new base. With 32 cores that's 210032 = 6.4GB of memory. It's easy to imagine how due to topology this working set got larger, or we see the effects of memory fragmentation. The latter shouldn't happen as buffers are reused, but I could imagine that this is related to re-alloc as the buffers might be subtly growing.

Screenshot 2023-05-14 at 08 41 33

Something else I noticed is that the performance was degrading over time as more and more cores stopped getting any data. This is expected to happen on imbalanced delta-trees where one root node (i.e. the first base) has a lot of deltas on top of it. That's how the first 1.5TB of objects would be decompressed with about 12GB/s using all cores, but at 3TB there was only 1 core left to process the remaining 4TB at only 1.8GB/s - 2.1GB/s.

Screenshot 2023-05-14 at 08 39 39

When looking at the profile it's interesting to see that the delta-application and decompression of pack entries takes insignificant time. Instead, it is really, really busy computing SHA1s of the objects.

When looking at the output while only one core is left for processing, we see that there is nearly 50k objects left for one core, which means there indeed is a single base that has ~50k deltas on top of it.

Screenshot 2023-05-14 at 08 42 05

Unfortunately, over time, memory usage rose and at 6.7TB of decompressed objects in, it was at 7GB. This is probably due to branching bases so it had to keep much more than one base in memory at a time. Fortunately shortly after it went down to ~5GB again.

Here is the final result of the clone invocation.

❯ /usr/bin/time -lp gix clone https://github.com/fz139/vigruzki.git
 08:28:54 read pack done 6.2GB in 706.56s (8.7MB/s)
 08:28:54  indexing done 570.4k objects in 684.49s (833.0 objects/s)
 08:28:54 decompressing done 13.6GB in 684.49s (19.9MB/s)
 09:11:46     Resolving done 570.4k objects in 2572.62s (221.0 objects/s)
 09:11:46      Decoding done 7.0TB in 2572.62s (2.7GB/s)
 09:11:47 writing index file done 19.4MB in 0.03s (751.0MB/s)
 09:11:47  create index file done 570.4k objects in 3257.24s (175.0 objects/s)
 09:11:47           checkout done 5.0 files in 0.50s (10.0 files/s)
 09:11:47            writing done 245.3MB in 0.50s (493.2MB/s)
HEAD:refs/remotes/origin/HEAD (implicit)
        b3685dccd158bade1a7fe149a7d5f5e43204348f HEAD symref-target:refs/heads/main -> refs/remotes/origin/HEAD [new]
+refs/heads/*:refs/remotes/origin/*
        b3685dccd158bade1a7fe149a7d5f5e43204348f refs/heads/main -> refs/remotes/origin/main [new]
real 3280.63
user 3872.36
sys 307.70
          8123875328  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            20173971  page reclaims
              348743  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                 757  messages sent
             9082910  messages received
                   2  signals received
              164472  voluntary context switches
             8884492  involuntary context switches
      17280558917972  instructions retired
      12706987488513  cycles elapsed
         15496447232  peak memory footprint

git ( master) [?] via 🐍 took 54m40s

In terms of memory usage, we see that it peaked at 8GB of allocated memory when using 10 cores, while the full set of memory usage (including virtual memory) is 8GB + 7GB (the memory mapped pack).

It's interesting that in terms of time taken, gitoxide is still competitive despite being hampered in its concurrency, even though it comes at the cost of higher memory usage. However, on my CPU the average decompression rate was nearly twice as high as on the Ryzen 9. Can it be that the sha1 implementation is worse on that platform, or that gitoxide wasn't compiled with the max settings?

Any kind of remedy would try to bring the maximum resident size down.

Workarounds

Bringing down memory usage

The measured memory usage is peak memory and I'd expect that to be much lower as mean. However, one way to reduce it is to limit the amount of cores used.

gix -t10 clone https://github.com/fz139/vigruzki.git for example should bring peak usage down to ~5GB peak, this is what I tested by merit of only having 8 fast cores and 2 slower ones.

Use shallow clones

Depending on the usecase, not having the full history might be sufficient.

gix clone --depth 1 https://github.com/fz139/vigruzki.git would definitely not show any issue, and future fetches wouldn't either unless they are so far apart that it would once again fetch a pack with incredibly long delta chains.

Future solutions

This will still have to be implemented

Use a different resolution algorithm

Use a the 'less-memory' algorithm which will parallelize per object and use a couple of caches to speed up pack access similar to what git does. It should be much faster in this particular case as it can keep the cores busy.

gix no-repo pack verify -a less-memory .git/objects/pack/pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx pretty much immediately goes out of memory causing instant swapping as by default, it uses a 64-slot pack delta cache, one per thread. Thus each thread very quickly fills up 6.4GB of RAM just to operate. There is an alternative cache backend that measures memory instead of units, but it's too slow to be general purpose. Maybe the current slot-cache could be updated to track memory as well?

Work-stealing

Ideally, the existing standard algorithm would be able to deal with such a case gracefully and instead find a way to feed other threads. With the current data structure, that's not possible as each thread owns a single base in the tree, and only the owner can safely traverse it. I could imagine that each thread operating on a base checks if there are unused threads and if so, spawns its own threads to help finishing the work. If the data structure can support that I would definitely be tempted to implement that.

Special handling for large objects (out of scope)

In theory, there could be heuristics that could apply special handling to large objects. This could be done by memory mapping them right away, or by unloading them at opportune moments and reloading those objects before use (at the cost of having to traverse a possibly large delta chain). Note that this itself might be prohibitive unless a proper delta-cache is used as 200GB/s of memory bandwidth would just be good for 2000 deltas/s in this scenario.
However, big-file support is probably its very own topic, and this issue can be solved by limiting the amount of threads and increasing parallelism.

@Byron Byron linked a pull request May 14, 2023 that will close this issue
8 tasks
Byron added a commit that referenced this issue May 14, 2023
…much. (#851)

Previously it would take a buffer from the free-list, copy data into it, and
when exceeding the capacity loose it entirely.
Now the freelist is handled correctly.
Byron added a commit that referenced this issue May 14, 2023
Previously, the 64 slot big LRU cache for pack deltas didn't
use any memory limit which could lead to memory exhaustion in the
face of untypical, large objects.

Now we add a generous default limit to do *better* in such situations.
It's worth noting though that that even without any cache, the working
set of buffers to do delta resolution takes considerable memory, despite
trying to keep it minimal.

Note that for bigger objects, the cache is now not used at all, which probably
leads to terrible performance as not even the base object can be cached.
Byron added a commit that referenced this issue May 14, 2023
Previously when traversing a pack it could appear to hang as checks
were only performed on chunk or base (of a delta-tree) level.

Now interrupt checks are performed more often to stop all work much quicker.
Byron added a commit that referenced this issue May 14, 2023
Previously, the 64 slot big LRU cache for pack deltas didn't
use any memory limit which could lead to memory exhaustion in the
face of untypical, large objects.

Now we add a generous default limit to do *better* in such situations.
It's worth noting though that that even without any cache, the working
set of buffers to do delta resolution takes considerable memory, despite
trying to keep it minimal.

Note that for bigger objects, the cache is now not used at all, which probably
leads to terrible performance as not even the base object can be cached.
Byron added a commit that referenced this issue May 14, 2023
#851)

This change will put more of the delta-chain into the cache
which possibly leads to increased chances of cache-hits if objects
aren't queried in random order, but in pack-offset order.

Note that in general, it tends to be faster to not use any cache at all.

This change was pruned back right away as the major difference to git,
which does it by storing every object of the chain in the cache, is that
we don't share the cache among threads. This leaves a much smaller per-thread
cache size which really is a problem if the objects are large.

So instead of slowing pack access down by trying it, with the default cache being
unsuitable as it would evict all the time due to memory overruns, we do nothing
here and rather improve the performance when dealing with pathological cases
during pack traversal.
Byron added a commit that referenced this issue May 14, 2023
…ry limits. (#851)

Previously the 64 slot LRU cache didn't have any limit, now one is implemented that
defaults to about 96MB.
Byron added a commit that referenced this issue May 14, 2023
Threads started for working on an entry in a slice can now see the amount
of threads left for use (and manipulate that variable) which effectively
allows them to implement their own parallelization on top of the current one.

This is useful if there is there is very imbalanced work within the slice itself.
Byron added a commit that referenced this issue May 14, 2023
Previously we would use `resize(new_len, 0)` to resize buffers, even though
these values would then be overwritten (or the buffer isn't available).

Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe
a memset.
Byron added a commit that referenced this issue May 14, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
Byron added a commit that referenced this issue May 14, 2023
Previously we would use `resize(new_len, 0)` to resize buffers, even though
these values would then be overwritten (or the buffer isn't available).

Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe
a memset.
Byron added a commit that referenced this issue May 14, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
Byron added a commit that referenced this issue May 15, 2023
Threads started for working on an entry in a slice can now see the amount
of threads left for use (and manipulate that variable) which effectively
allows them to implement their own parallelization on top of the current one.

This is useful if there is there is very imbalanced work within the slice itself.
Byron added a commit that referenced this issue May 15, 2023
Previously we would use `resize(new_len, 0)` to resize buffers, even though
these values would then be overwritten (or the buffer isn't available).

Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe
a memset.
Byron added a commit that referenced this issue May 15, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
@Byron
Copy link
Member

Byron commented May 15, 2023

For completeness, here is an analysis of the pack, with the most impressive 1:1139 compression ratio:

objects per delta chain length
         0: 329964
         1: 405
         2: 2057
         3: 6640
         4: 15038
         5: 24296
         6: 33867
         7: 30799
         8: 26628
         9: 21595
        10: 16215
        11: 10958
        12: 8015
        13: 6201
        14: 4951
        15: 3913
        16: 3408
        17: 3002
        18: 2599
        19: 2104
        20: 1865
        21: 1581
        22: 1380
        23: 1245
        24: 1134
        25: 993
        26: 1107
        27: 854
        28: 742
        29: 704
        30: 596
        31: 579
        32: 489
        33: 426
        34: 424
        35: 402
        36: 437
        37: 349
        38: 264
        39: 246
        40: 229
        41: 186
        42: 193
        43: 157
        44: 116
        45: 146
        46: 148
        47: 191
        48: 112
        49: 143
        50: 260
        ->: 570353

averages
        delta chain length:            4;
        decompressed entry [B]:        23891;
        compressed entry [B]:          10818;
        decompressed object size [B]:  12323900;

compression
        compressed entries size       : 6.2 GB
        decompressed entries size     : 13.6 GB
        total object size             : 7.0 TB
        pack size                     : 6.2 GB

        num trees                     : 109975
        num blobs                     : 350403
        num commits                   : 109975
        num tags                      : 0

        compression ratio             : 2.21
        delta compression ratio       : 1139.10
        delta gain                    : 515.84
        pack overhead                 : 0.030%

Created with gix no-repo pack verify ./vigruzki/.git/objects/pack/pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx -s.

Byron added a commit that referenced this issue May 15, 2023
Threads started for working on an entry in a slice can now see the amount
of threads left for use (and manipulate that variable) which effectively
allows them to implement their own parallelization on top of the current one.

This is useful if there is there is very imbalanced work within the slice itself.

While at it, we also make consumer functions mutable as they exsit per thread.
Byron added a commit that referenced this issue May 15, 2023
Previously we would use `resize(new_len, 0)` to resize buffers, even though
these values would then be overwritten (or the buffer isn't available).

Now we use `set_len(new_len)` after calling `reserve` to do the same, but safe
a memset.
Byron added a commit that referenced this issue May 15, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
@Byron
Copy link
Member

Byron commented May 15, 2023

It's still a little rough around the edges, but thanks to improved parallelization the workload now finishes in 8m45s with 8.5 cores - the number isn't a clone though, but something equivalent to the pack resolution stage.

 19:46:58                                                         Resolving done 570.4k objects in 507.57s (1.1k objects/s)
 19:46:58                                                          Decoding done 7.0TB in 507.57s (13.9GB/s)
real 509.29
user 4150.72
sys 311.61
         11928485888  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            27353928  page reclaims
              387963  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              296439  voluntary context switches
             4734483  involuntary context switches
      16713323515677  instructions retired
      12897346091042  cycles elapsed
         16174221056  peak memory footprint

Peak-memory is relative to the amount of parallelism, so the usage is definitely higher than git, but I guess that limiting the amount of cores to 5 will still be much faster than git with similar memory usage.

Byron added a commit that referenced this issue May 15, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
@0xdeafbeef
Copy link
Author

Now it's 3 times faster :)

 21:51:41     Resolving done 570.7k objects in 1131.33s (504.0 objects/s)
 21:51:41      Decoding done 7.0TB in 1131.33s (6.2GB/s)
 21:51:42 writing index file done 19.4MB in 0.02s (878.6MB/s)
 21:51:42  create index file done 570.7k objects in 1545.90s (369.0 objects/s)
 21:51:42           checkout done 5.0 files in 0.33s (15.0 files/s)
 21:51:42            writing done 243.2MB in 0.33s (737.4MB/s)
HEAD:refs/remotes/origin/HEAD (implicit)
        561964c901732259ea0dde3eefd112f4a3fcfe6b HEAD symref-target:refs/heads/main -> refs/remotes/origin/HEAD [new]
+refs/heads/*:refs/remotes/origin/*
        561964c901732259ea0dde3eefd112f4a3fcfe6b refs/heads/main -> refs/remotes/origin/main [new]
        Command being timed: "gix clone https://github.com/fz139/vigruzki.git"
        User time (seconds): 8709.54
        System time (seconds): 2091.96
        Percent of CPU this job got: 690%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 26:05.23
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 32113600
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 553262617
        Voluntary context switches: 2765411
        Involuntary context switches: 526050
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Still pretty interesting why you have 13.9GB/s decode speed. On which hardware have you tested?
I've checked with RUSTFLAGS='-C target-cpu=native', but it has the same speed.

@Byron
Copy link
Member

Byron commented May 16, 2023

Three times isn't as much as I would have hoped for with a 32 core machine, but then again, maybe it is something about the hardware. I am using an M1 Pro and it did 'kick like a mule' in this workload, with the highest number I have ever seen due to overcommitting (something that shouldn't have happened but there might be something off about the thread-counting). Maybe it's that good because of the high memory throughput of 200GB/s? Knowing the delta-apply algorithm it's clear why it would be so intense. SHA1 performance is also important for this workload though, and typically the bottleneck.

Maybe you can run some tests (once the PR is finalized today) with different thread-counts to see what happens - maybe some CPUs bottleneck elsewhere prefer lower core-counts. At least running it with less cores will keep the memory usage down, as 28GB seems a bit excessive even though there is nothing that can be done about it since it's strictly needed per-thread working memory which baloons thanks to the massive file sizes.


Edit: I fixed the over-allocation issue and maybe that even made it a little slower on my machine (but better to leave over-commits in the hands of the user and the -t flag I think), but on the bright side that returns control over how much memory is used by adjusting the thread count.

 09:48:01                                                         Resolving done 570.4k objects in 522.24s (1.1k objects/s)
 09:48:01                                                          Decoding done 7.0TB in 522.24s (13.5GB/s)
real 523.84
user 4073.76
sys 282.95
          9921167360  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            22555056  page reclaims
              389731  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              188018  voluntary context switches
             3826566  involuntary context switches
      16635024776616  instructions retired
      12779135032432  cycles elapsed
         15132148096  peak memory footprint

Despite some details still to be sorted, I think from a performance point of view this is the final result. The above stems from cargo build --release --bin gix --no-default-features --features=max,cache-efficiency-debug && /usr/bin/time -lp target/release/gix -t10 --strict no-repo pack verify -a less-time tests/fixtures/repos/vigruzki/.git/objects/pack/pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx (it's like a clone, but skips fetching the pack and writing the index), and it allows to see how much memory is required for 10 threads just to decompress the whole pack and all 7TB of it.

To get memory usage down to about git level, one can run it with -t5 to reduce memory requirements, while still being faster than git by merit of decompressing the tree without any waste.

✦ ❯ cargo build --release --bin gix --no-default-features --features=max,cache-efficiency-debug  && /usr/bin/time -lp target/release/gix -t5 --strict no-repo pack verify -a less-time tests/fixtures/repos/vigruzki/.git/objects/pack/pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx
   Compiling gix-pack v0.35.0 (/Users/byron/dev/github.com/Byron/gitoxide/gix-pack)
   Compiling gix-odb v0.45.0 (/Users/byron/dev/github.com/Byron/gitoxide/gix-odb)
   Compiling gix v0.44.1 (/Users/byron/dev/github.com/Byron/gitoxide/gix)
   Compiling gitoxide-core v0.27.0 (/Users/byron/dev/github.com/Byron/gitoxide/gitoxide-core)
   Compiling gitoxide v0.25.0 (/Users/byron/dev/github.com/Byron/gitoxide)
    Finished release [optimized] target(s) in 1m 48s
 09:58:31 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.01s (1.3GB/s)
 09:58:31                                           collecting sorted index done 570.4k entries in 0.07s (8.3M entries/s)
 09:58:32                                                          indexing done 570.4k objects in 0.79s (725.3k objects/s)
 09:58:38 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 6.30s (979.6MB/s)
 10:12:37                                                         Resolving done 570.4k objects in 845.02s (674.0 objects/s)                                                                                                 
 10:12:37                                                          Decoding done 7.0TB in 845.02s (8.3GB/s)                                                                                                                  
real 846.12
user 3804.85
sys 246.92
          6831898624  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            20776724  page reclaims
              379765  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              144781  voluntary context switches
             1480049  involuntary context switches
      16579642418071  instructions retired
      12406286378731  cycles elapsed
         14017509056  peak memory footprint

I had to try harder to stay within the 5GB memory envelope and didn't manage even with -t3.

✦ ❯ /usr/bin/time -lp target/release/gix -t3 --strict no-repo pack verify -a less-time tests/fixtures/repos/vigruzki/.git/objects/pack/pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx
 10:17:11 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.01s (1.4GB/s)
 10:17:11                                           collecting sorted index done 570.4k entries in 0.06s (9.9M entries/s)
 10:17:12                                                          indexing done 570.4k objects in 0.96s (596.4k objects/s)
 10:17:16 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 4.37s (1.4GB/s)
 10:38:58                                                         Resolving done 570.4k objects in 1305.98s (436.0 objects/s)
 10:38:58                                                          Decoding done 7.0TB in 1305.98s (5.4GB/s)
real 1307.11
user 3646.71
sys 213.30
          7691157504  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            17130302  page reclaims
              379928  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               61664  voluntary context switches
              390798  involuntary context switches
      16515947213704  instructions retired
      12141757486965  cycles elapsed
          9742240896  peak memory footprint

This definitely has a strong flavor of non-determinism as -t3 seems to be worse than -t5.
The reason for that might be that it puts base-buffers for deltas onto a stack, and if there is lots of deltas to apply on different levels, there might be plenty of these on the stack at the same time. However, tuning this further the memory usage was reduced, possibly at a slight decrease of performance. It is a question though if peak-memory is also to some extend non-deterministic based on which part of the tree is currently traversed and by how many threads.

 10:53:37 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.03s (748.0MB/s)
 10:53:37                                           collecting sorted index done 570.4k entries in 0.08s (7.3M entries/s)
 10:53:38                                                          indexing done 570.4k objects in 1.16s (491.1k objects/s)
 10:53:43 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 5.32s (1.2GB/s)
 11:15:41                                                         Resolving done 570.4k objects in 1322.25s (431.0 objects/s)
 11:15:41                                                          Decoding done 7.0TB in 1322.25s (5.3GB/s)
real 1323.84
user 3605.53
sys 307.85
          5979684864  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            47652772  page reclaims
              373634  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               86723  voluntary context switches
              649922  involuntary context switches
      16323271351369  instructions retired
      12258214428702  cycles elapsed
          9763098368  peak memory footprint

And even though memory usage is still all over the place, another tweak later and the performance has more than recovered.

 11:42:41 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.02s (1.1GB/s)
 11:42:41                                           collecting sorted index done 570.4k entries in 0.08s (7.3M entries/s)
 11:42:42                                                          indexing done 570.4k objects in 1.17s (485.6k objects/s)
 11:42:46 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 5.46s (1.1GB/s)
 12:03:49                                                         Resolving done 570.4k objects in 1267.26s (450.0 objects/s)
 12:03:49                                                          Decoding done 7.0TB in 1267.26s (5.5GB/s)
real 1268.83
user 3538.17
sys 240.75
          8157265920  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            29945635  page reclaims
              368716  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               32305  voluntary context switches
              226891  involuntary context switches
      15921477517661  instructions retired
      11912481385856  cycles elapsed
          9325478592  peak memory footprint

There is one more thing I can imagine has positive impact over memory usage, but it's a bit of a tough sell with current Rust where closures can't return data they own (or reference, for that matter).

Another tweak later and it's now possible to avoid yet another data copy in the hopes that it reduces the stress on the allocator, but apparently to small effect except that it might have gotten a little faster. Here is the -t10 version as worst-case scenario for memory:

 12:59:44 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.02s (1.1GB/s)
 12:59:44                                           collecting sorted index done 570.4k entries in 0.08s (7.2M entries/s)
 12:59:45                                                          indexing done 570.4k objects in 1.16s (491.2k objects/s)
 12:59:52 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 7.90s (781.0MB/s)
 13:08:20                                                         Resolving done 570.4k objects in 514.83s (1.1k objects/s)
 13:08:20                                                          Decoding done 7.0TB in 514.83s (13.7GB/s)
real 516.46
user 3915.12
sys 392.39
          9803333632  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            53400785  page reclaims
              389493  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              227742  voluntary context switches
             5251277  involuntary context switches
      16126634767788  instructions retired
      12609510150035  cycles elapsed
         15450572480  peak memory footprint

And here is the same but with -t3 , and now all options are exhausted.

 13:31:40                                                         Resolving done 570.4k objects in 1282.38s (444.0 objects/s)=========================================================================================>------ 13:31:40                                                          Decoding done 7.0TB in 1282.38s (5.5GB/s)
real 1283.48
user 3543.17
sys 260.62
          5911642112  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            32669010  page reclaims
              382137  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   2  signals received
               40587  voluntary context switches
              545854  involuntary context switches
      15808696627745  instructions retired
      11917709163368  cycles elapsed
          9302280128  peak memory footprint

Trying once more with even more optimal memory allocation, it got faster even though overall, the peak-usage didn't change much (-t10).

 14:43:03 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.03s (720.7MB/s)
 14:43:03                                           collecting sorted index done 570.4k entries in 0.08s (7.0M entries/s)
 14:43:04                                                          indexing done 570.4k objects in 0.86s (665.7k objects/s)
 14:43:10 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 7.81s (790.1MB/s)
 14:51:25                                                         Resolving done 570.4k objects in 501.84s (1.1k objects/s)
 14:51:25                                                          Decoding done 7.0TB in 501.84s (14.0GB/s)
real 503.17
user 3929.11
sys 384.25
         10464952320  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            52856179  page reclaims
              392649  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              197228  voluntary context switches
             4572222  involuntary context switches
      16119780233179  instructions retired
      12598385915260  cycles elapsed
         16245852096  peak memory footprint

The same with -t3:

✦ ❯ /usr/bin/time -lp target/release/gix -t3 --strict no-repo pack verify -a less-time tests/fixtures/repos/vigruzki/.git/objects/pack/pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx
 14:52:48 Hash of index 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.idx' done 19.4MB in 0.02s (1.1GB/s)
 14:52:48                                           collecting sorted index done 570.4k entries in 0.06s (9.2M entries/s)
 14:52:49                                                          indexing done 570.4k objects in 0.99s (576.9k objects/s)
 14:52:52 Hash of pack 'pack-20b60ba716f8fca721e463d06cdf056e78b4ba02.pack' done 6.2GB in 4.45s (1.4GB/s)
 15:14:04                                                         Resolving done 570.4k objects in 1274.99s (447.0 objects/s)
 15:14:04                                                          Decoding done 7.0TB in 1274.99s (5.5GB/s)
real 1276.15
user 3528.30
sys 251.79
          8180482048  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            31607126  page reclaims
              384910  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               67427  voluntary context switches
              444719  involuntary context switches
      15801255809165  instructions retired
      11881250335914  cycles elapsed
          9669608768  peak memory footprint

Byron added a commit that referenced this issue May 16, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
Byron added a commit that referenced this issue May 16, 2023
When delta-trees are unbalanced, in pathological cases it's possible that that one thread
ends up with more than half of the work. In this case it's required that it manages to
spawn its own threads to parallelize the work it has.
@Byron
Copy link
Member

Byron commented May 16, 2023

@0xdeafbeef I'd be very interested in seeing what the latest main branch can do for your performance and memory consumption given all cores, and maybe also three. Memory consumption should be better all around, and performance should be through the roof, using all cores at all times. To some extend, memory consumption can be adjusted by reducing the amount of threads. Thank you!

@Byron
Copy link
Member

Byron commented May 16, 2023

It likes like this also caused a regression 😅. Here is how fast it was:

gitoxide (4f879bf) [$] took 1m57s
❯ cargo build --release && /usr/bin/time -lp ./target/release/gix --strict no-repo pack verify -a less-time ~/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux/.git/objects/pack/pack-3498b2b93a0b0a4c1ebb65bd130c57c485d4a28e.idx
    Finished release [optimized] target(s) in 0.26s
 22:02:00 Hash of index 'pack-3498b2b93a0b0a4c1ebb65bd130c57c485d4a28e.idx' done 278.9MB in 0.17s (1.7GB/s)
 22:02:01                                           collecting sorted index done 9.1M entries in 0.94s (9.7M entries/s)
 22:02:02 Hash of pack 'pack-3498b2b93a0b0a4c1ebb65bd130c57c485d4a28e.pack' done 2.6GB in 1.48s (1.8GB/s)
 22:02:02                                                          indexing done 9.1M objects in 1.00s (9.1M objects/s)
 22:02:13                                                         Resolving done 9.1M objects in 10.43s (872.1k objects/s)
 22:02:13                                                          Decoding done 118.9GB in 10.43s (11.4GB/s)
real 12.65
user 98.79
sys 4.46
          4572413952  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              337395  page reclaims
                  11  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  87  voluntary context switches
               92037  involuntary context switches
        495860894234  instructions retired
        296769316300  cycles elapsed
          1447909120  peak memory footprint

And here is how slow it is now:

❯ /usr/bin/time -lp gix --strict no-repo pack verify -a less-time ~/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux/.git/objects/pack/pack-3498b2b93a0b0a4c1ebb65bd130c57c485d4a28e.idx
 21:52:45 Hash of index 'pack-3498b2b93a0b0a4c1ebb65bd130c57c485d4a28e.idx' done 278.9MB in 0.16s (1.7GB/s)
 21:52:46                                           collecting sorted index done 9.1M entries in 0.93s (9.8M entries/s)
 21:52:47 Hash of pack 'pack-3498b2b93a0b0a4c1ebb65bd130c57c485d4a28e.pack' done 2.6GB in 1.47s (1.8GB/s)
 21:52:47                                                          indexing done 9.1M objects in 0.98s (9.3M objects/s)
 21:53:07                                                         Resolving done 9.1M objects in 20.06s (453.5k objects/s)
 21:53:07                                                          Decoding done 118.9GB in 20.06s (5.9GB/s)
real 22.13
user 98.23
sys 90.59
          4545511424  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              338725  page reclaims
                  25  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  76  voluntary context switches
              108671  involuntary context switches
        568773121684  instructions retired
        544963589983  cycles elapsed
          1447892800  peak memory footprint

It seems that small-object performance suffered, but I am optimistic that it is possible to get that performance back. It's probably related to how buffers are handled now.

@0xdeafbeef
Copy link
Author

Interestingly, it doesn't scale well from cpu count:

 Decoding done 7.0TB in 800.58s (8.8GB/s)

 Command being timed: "gix --strict no-repo pack verify .git/objects/pack/pack-1ba84aeb43e8
825baef6cd1793884429c615fb3d.idx"
        User time (seconds): 12560.84
        System time (seconds): 4576.85
        Percent of CPU this job got: 2139%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 13:21.07
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 37309296
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 576562513
        Voluntary context switches: 851821
        Involuntary context switches: 1114095
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
 Decoding done 7.0TB in 791.27s   (8.9GB/s)

        Command being timed: "gix -t 22 --strict no-repo pack verify .git/objects/pack/pack-1ba84a
eb43e8825baef6cd1793884429c615fb3d.idx"
        User time (seconds): 10666.94
        System time (seconds): 3421.53
        Percent of CPU this job got: 1779%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 13:11.74
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 26552224
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 571258953
        Voluntary context switches: 670522
        Involuntary context switches: 448624
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
     Decoding done 7.0TB in 769.04s (9.2GB/s)
        Command being timed: "gix -t 12 --strict no-repo pack verify .git/objects/pack/pack-1ba84a
eb43e8825baef6cd1793884429c615fb3d.idx"
        User time (seconds): 6607.34
        System time (seconds): 1778.18
        Percent of CPU this job got: 1089%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 12:49.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 21323724
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 567786754
        Voluntary context switches: 358802
        Involuntary context switches: 230123
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
 Decoding done 7.0TB in 779.73s (9.0GB/s)
        Command being timed: "gix -t 8 --strict no-repo pack verify .git/objects/pack/pack-1ba84ae
b43e8825baef6cd1793884429c615fb3d.idx"
        User time (seconds): 4787.99
        System time (seconds): 1230.45
        Percent of CPU this job got: 771%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 13:00.40
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 21093392
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 542195949
        Voluntary context switches: 188702
        Involuntary context switches: 125745
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
Decoding done 7.0TB in 1286.63s (5.5GB/s)
        Command being timed: "gix -t3 --strict no-repo pack verify .git/objects/pack/pack-1ba84aeb
43e8825baef6cd1793884429c615fb3d.idx"
        User time (seconds): 3247.05
        System time (seconds): 595.46
        Percent of CPU this job got: 298%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 21:27.05
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 14095200
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 554682581
        Voluntary context switches: 74951
        Involuntary context switches: 72675
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@Byron
Copy link
Member

Byron commented May 17, 2023

Thanks for sharing the measurements! It's definitely interesting to see how the sweet-spot seems to be around 12 threads, with the most efficient option being 8 threads or so. I think the CPU might be bottle-necking somewhere, as from my observations the work-stealing and synchronization around it isn't a problem at just ~200 objects per second, with each sync taking more than one piece of work from the queue. On a Ryzen, I have seen it scale to 35GB/s decoding speed which, yes, means it can decode 100GB of kernel code in just about 3.3s 😅.

What's more alarming is the memory usage, which is way too high even with -t3. On my system, it's seeing vastly different numbers there which probably means the system allocator deals very differently with this kind of workload. Since this is an important operation, consistent quality would definitely be valuable so I will see if using jemalloc is the right thing to do here.

@Byron
Copy link
Member

Byron commented May 17, 2023

It looks like jemalloc doesn't build on all targets that gitoxide supports, so it's nothing that can ship by default. In my tests, it also seemed to be a bit worse than the system allocator in terms of peak memory usage, but I would expect this to be consistent then. In any case, it seems that there is not much else that can be done about memory usage, even though it should have improved slightly once once #854 is merged due to less trashing/more reuse where possible.

@Byron
Copy link
Member

Byron commented May 19, 2023

I finally managed to get another measurement on a 128core Intel machine with 256GB of RAM.

 13:19:01 read pack done 6.7GB in 210.47s (31.7MB/s)
 13:19:01  indexing done 571.0k objects in 197.04s (2.9k objects/s)
 13:19:01 decompressing done 14.6GB in 197.04s (74.1MB/s)
 13:21:44     Resolving done 571.0k objects in 163.10s (3.5k objects/s)
 13:21:44      Decoding done 7.0TB in 163.10s (43.2GB/s)
 13:21:44 writing index file done 19.4MB in 0.03s (636.5MB/s)
 13:21:44  create index file done 571.0k objects in 360.29s (1.6k objects/s)
 13:21:45           checkout done 5.0 files in 0.45s (11.0 files/s)
 13:21:45            writing done 244.5MB in 0.45s (539.5MB/s)
HEAD:refs/remotes/origin/HEAD (implicit)
        752a39c3c7469272dbbf8797b782a0556e55ea39 HEAD symref-target:refs/heads/main -> refs/remotes/origin/HEAD [new]
+refs/heads/*:refs/remotes/origin/*
        752a39c3c7469272dbbf8797b782a0556e55ea39 refs/heads/main -> refs/remotes/origin/main [new]
        Command being timed: "./target/release/gix clone https://github.com/fz139/vigruzki.git"
        User time (seconds): 6903.68
        System time (seconds): 615.83
        Percent of CPU this job got: 2006%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:14.67
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 31378692
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 28
        Minor (reclaiming a frame) page faults: 8583053
        Voluntary context switches: 2670209
        Involuntary context switches: 10535
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Performance isn't as crazy as it probably should be, but fortunately peak memory usage isn't proportional to the core count either. The major number here is 43.2GB/s, which is three times faster than my M1 Pro, so roughly equivalent to 30 cores. Since it's a cloud machine, it might be throttled in various ways though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants