Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed & binary size comparison for Clang binary #25

Open
marxin opened this issue Aug 8, 2024 · 11 comments
Open

Speed & binary size comparison for Clang binary #25

marxin opened this issue Aug 8, 2024 · 11 comments

Comments

@marxin
Copy link
Contributor

marxin commented Aug 8, 2024

Kudos to the project, I can now link clang-20 binary that is part of the LLVM project and is significantly big (~190 MiB w/o debug info). The binary can be quite easily built with the following series of commands:

$ git clone --depth=1 git@github.com:llvm/llvm-project.git
$ cd llvm-project
$ mkdir build
$ cd build
$ cmake -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles" ../llvm
$ make -j24 clang

One needs to trim the -rpath linker flags to link the binary with Wild. The speed results on 24 CPU AMD Ryzen 9 7900X are the following (using --no-fork for mold):

Benchmark 1: Wild (clang)
  Time (mean ± σ):     211.4 ms ±   7.0 ms    [User: 837.9 ms, System: 1275.5 ms]
  Range (min … max):   199.4 ms … 223.6 ms    14 runs
 
Benchmark 2: Mold (clang)
  Time (mean ± σ):     191.7 ms ±  15.6 ms    [User: 1721.1 ms, System: 1043.5 ms]
  Range (min … max):   175.4 ms … 230.0 ms    12 runs
 
Benchmark 3: BFD ld (clang)
  Time (mean ± σ):      2.470 s ±  0.104 s    [User: 1.521 s, System: 0.947 s]
  Range (min … max):    2.388 s …  2.686 s    10 runs
 
Benchmark 4: LLD (clang)
  Time (mean ± σ):     300.4 ms ±  11.1 ms    [User: 610.6 ms, System: 1218.5 ms]
  Range (min … max):   284.4 ms … 324.7 ms    10 runs
 
Summary
  Mold (clang) ran
    1.10 ± 0.10 times faster than Wild (clang)
    1.57 ± 0.14 times faster than LLD (clang)
   12.88 ± 1.18 times faster than BFD ld (clang)

The results are promising for the Wild linker which is on par with Mold. Similarly for binary size (both File size and VM Size as reported by bloaty tool):

✦ ❯ bloaty clang-20.mold
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  53.2%   107Mi  63.0%   107Mi    .text
  18.3%  37.2Mi  21.7%  37.2Mi    .rodata
  11.9%  24.1Mi   0.0%       0    .strtab
   4.8%  9.76Mi   5.7%  9.76Mi    .eh_frame
   4.0%  8.20Mi   0.0%       0    .symtab
   2.2%  4.40Mi   2.6%  4.40Mi    .dynstr
   2.1%  4.16Mi   2.4%  4.16Mi    .data.rel.ro
   0.9%  1.80Mi   1.1%  1.80Mi    .rodata.str1.1
   0.7%  1.51Mi   0.9%  1.51Mi    .rodata.str1.8
   0.6%  1.28Mi   0.7%  1.28Mi    .dynsym
   0.6%  1.17Mi   0.7%  1.17Mi    .eh_frame_hdr
   0.0%       0   0.4%   712Ki    .bss
   0.2%   435Ki   0.2%   435Ki    .hash
   0.2%   371Ki   0.2%   371Ki    .gnu.hash
   0.2%   326Ki   0.2%   326Ki    .data
   0.1%   108Ki   0.1%   108Ki    .gnu.version
   0.0%  90.4Ki   0.1%  90.4Ki    .rodata.cst16
   0.0%  43.7Ki   0.0%  43.7Ki    .got
   0.0%  41.2Ki   0.0%  41.2Ki    .rela.dyn
   0.0%  29.0Ki   0.0%  24.8Ki    [33 Others]
   0.0%  12.3Ki   0.0%  12.3Ki    .rodata.cst8
 100.0%   202Mi 100.0%   171Mi    TOTAL
✦ ❯ bloaty clang-20.wild
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  56.0%   102Mi  55.8%   102Mi    .text
  22.3%  40.8Mi  22.2%  40.8Mi    .rodata
   0.0%      40  13.7%  25.2Mi    [LOAD #2 [R]]
  11.2%  20.5Mi   0.0%       0    .strtab
   4.9%  8.97Mi   4.9%  8.97Mi    .eh_frame
   2.5%  4.61Mi   0.0%       0    .symtab
   2.4%  4.42Mi   2.4%  4.42Mi    .data
   0.6%  1.02Mi   0.6%  1.02Mi    .eh_frame_hdr
   0.0%       0   0.4%   700Ki    .bss
   0.0%  43.5Ki   0.0%  43.5Ki    .rela.dyn
   0.0%  8.12Ki   0.0%  8.12Ki    .dynstr
   0.0%  7.59Ki   0.0%  7.59Ki    .dynsym
   0.0%  5.99Ki   0.0%  5.99Ki    .got
   0.0%  4.99Ki   0.0%  4.99Ki    .init_array
   0.0%  4.53Ki   0.0%  4.53Ki    .plt
   0.0%  1.94Ki   0.0%  1.94Ki    [ELF Section Headers]
   0.0%  1.27Ki   0.0%     761    [13 Others]
   0.0%  1.11Ki   0.0%       0    .debug_str
   0.0%     656   0.0%     656    .gnu.version_r
   0.0%     652   0.0%     652    .gnu.version
   0.0%     512   0.0%     512    .dynamic
 100.0%   182Mi 100.0%   183Mi    TOTAL
✦ ❯ bloaty clang-20.bfd
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  54.7%   107Mi  63.1%   107Mi    .text
  20.5%  40.4Mi  23.6%  40.4Mi    .rodata
  11.1%  21.8Mi   0.0%       0    .strtab
   4.9%  9.74Mi   5.7%  9.74Mi    .eh_frame
   2.6%  5.15Mi   0.0%       0    .symtab
   2.2%  4.40Mi   2.6%  4.40Mi    .dynstr
   2.1%  4.16Mi   2.4%  4.16Mi    .data.rel.ro
   0.6%  1.28Mi   0.7%  1.28Mi    .dynsym
   0.6%  1.17Mi   0.7%  1.17Mi    .eh_frame_hdr
   0.0%       0   0.4%   712Ki    .bss
   0.2%   408Ki   0.2%   408Ki    .gnu.hash
   0.2%   346Ki   0.2%   346Ki    .hash
   0.2%   326Ki   0.2%   326Ki    .data
   0.1%   109Ki   0.1%   109Ki    .gnu.version
   0.0%  41.8Ki   0.0%  41.8Ki    .rela.dyn
   0.0%  11.1Ki   0.0%  4.92Ki    [26 Others]
   0.0%  7.75Ki   0.0%       0    [Unmapped]
   0.0%  7.03Ki   0.0%  7.03Ki    .rela.plt
   0.0%  4.99Ki   0.0%  4.95Ki    .init_array
   0.0%  4.70Ki   0.0%  4.70Ki    .plt
   0.0%  3.83Ki   0.0%  3.83Ki    .got
 100.0%   197Mi 100.0%   170Mi    TOTAL

That said, I think Wild linker is doing well so far. What would be more interesting is the support for debug info, where binary like Clang has ~4.3 GiB and the linking process takes ~3 seconds for Mold linker (BFD is desperately slow).

The issue is not an issue, feel free to close it at any time ;)

@davidlattimore
Copy link
Owner

Thanks for trying this out as well as the other issues you filed and the sponsorship! There's so many things to work on, so it can be hard sometimes to pick the next thing to tackle. It would indeed be interesting to see a comparison with debug info, so that's good feedback.

One area where I've seen Wild perform less well is when there are lots of input objects. There's some bookkeeping that Wild does per-object and some of that bookkeeping is done on a single thread. I've been working the last week on some refactorings so that I can eventually move to do this bookkeeping on groups of objects rather than individual objects. I'm hopeful that this will improve performance in cases like this.

@davidlattimore
Copy link
Owner

I've finished the changes I mentioned to process files in groups. It indeed resulted in a speedup for linking a dev build of rustc, which was the case where I'd noticed that the overhead was significant.

Benchmark 1: grouped
  Time (mean ± σ):     372.8 ms ±   5.4 ms    [User: 1491.8 ms, System: 469.9 ms]
  Range (min … max):   367.4 ms … 381.1 ms    10 runs
 
Benchmark 2: ungrouped
  Time (mean ± σ):     480.7 ms ±   7.9 ms    [User: 1635.8 ms, System: 547.5 ms]
  Range (min … max):   464.2 ms … 490.1 ms    10 runs
 
Summary
  'grouped' ran
    1.29 ± 0.03 times faster than 'ungrouped'

It didn't speed up linking clang by nearly as much, but did speed it up a bit:

Benchmark 1: grouped
  Time (mean ± σ):     272.9 ms ±   5.1 ms    [User: 927.1 ms, System: 430.6 ms]
  Range (min … max):   266.3 ms … 280.8 ms    10 runs
 
Benchmark 2: ungrouped
  Time (mean ± σ):     292.0 ms ±   6.5 ms    [User: 956.2 ms, System: 436.1 ms]
  Range (min … max):   283.0 ms … 303.0 ms    10 runs
 
Summary
  'grouped' ran
    1.07 ± 0.03 times faster than 'ungrouped'

Something interesting is that the relative timings that I get differ from what you get. For example, on my laptop, linking clang with wild is 25% faster than with mold - that's before the grouping changes that I just pushed - after the grouping changes it's 32% faster. Whereas for you, without the grouping changes mold is 10% faster than wild. I assume this is due to the fairly different machines we're testing on. I'm testing on a few year old Intel-CPU laptop with 4 cores (8 threads), whereas your test machine has presumably 12 cores (24 threads). So this suggests that perhaps there's more I need to do with wild in terms of taking maximum advantage of multiple cores.

There is one spot in the code where I allocate and initialise a large vec in a single thread. I've got some ideas that should allow it to be left uninitialised and leave the initialisation for the separate threads to do.

@marxin
Copy link
Contributor Author

marxin commented Aug 9, 2024

Something interesting is that the relative timings that I get differ from what you get. For example, on my laptop, linking clang with wild is 25% faster than with mold - that's before the grouping changes that I just pushed - after the grouping changes it's 32% faster. Whereas for you, without the grouping changes mold is 10% faster than wild. I assume this is due to the fairly different machines we're testing on. I'm testing on a few year old Intel-CPU laptop with 4 cores (8 threads), whereas your test machine has presumably 12 cores (24 threads). So this suggests that perhaps there's more I need to do with wild in terms of taking maximum advantage of multiple cores.

I can confirm your latest modifications made the clang linking faster to:

  Time (mean ± σ):     180.1 ms ±  14.1 ms    [User: 715.1 ms, System: 1265.6 ms]
  Range (min … max):   162.8 ms … 200.8 ms    15 runs

which is in improvement of almost 20% on my machine 🎉

When it comes to head-to-head comparison of wild and mold for various CPU cores (I used --thread option for Mold and taskset for Wild):
Screenshot from 2024-08-09 13-38-55

That proves both linkers perform similarly for various numbers of CPU cores.

@marxin
Copy link
Contributor Author

marxin commented Aug 9, 2024

There's so many things to work on, so it can be hard sometimes to pick the next thing to tackle.

I share your feelings! Yeah, the modern linkers have plenty of features, support various hardware targets, and run on multiple operating systems.

For me personally, I think adding support for debug info should not be difficult (if you don't want to parse DWARF for things like GDB index), the only necessary thing is support for tombstone for dead symbols ( guess). That will show the full-speed potential of Wild as debug info tends to occupy a huge size portion of binaries. Plus, just recently, Mold added yet another background trick to postpone debug info streaming to a separate process (--separate-debug-file). And the second interesting option would be the introduction of incremental linking in the context of Rust projects. I am really curious if it can improve the edit,compile,debug cycle ❓

@marxin
Copy link
Contributor Author

marxin commented Aug 12, 2024

Time (mean ± σ): 180.1 ms ± 14.1 ms [User: 715.1 ms, System: 1265.6 ms]

After the recent changes, Wild is even doing better:

Time (mean ± σ): 160.2 ms ± 6.2 ms [User: 816.2 ms, System: 1243.9 ms]

@davidlattimore
Copy link
Owner

Nice! Unfortunately I regressed one benchmark with the earlier grouping-of-files change, since I end up with lots of large files in one shard. I'm working now on a making sharding more flexible so that I can have a variable number of files in each shard and base the shard size on number of symbols rather than number of objects.

It's interesting how the split between user and system CPU is so different on your system than on mine. For me, all the linkers show more user time and less system time, whereas for you it's the other way around. e.g. this is what it looks like for me:

Benchmark 1: wild
  Time (mean ± σ):     258.6 ms ±   6.3 ms    [User: 964.9 ms, System: 454.6 ms]
  Range (min … max):   250.2 ms … 269.7 ms    11 runs
 
Benchmark 2: mold
  Time (mean ± σ):     387.7 ms ±  10.9 ms    [User: 1907.0 ms, System: 389.0 ms]
  Range (min … max):   372.1 ms … 414.0 ms    10 runs
 
Summary
  'wild' ran
    1.50 ± 0.06 times faster than 'mold'

I guess it's probably due to differences in systems - e.g. kernel, CPU, filesystem etc.

@marxin
Copy link
Contributor Author

marxin commented Aug 12, 2024

It's related to the number of used CPU cores. More cores, more synchronization work I guess:

4 cores = Time (mean ± σ): 252.6 ms ± 13.4 ms [User: 385.0 ms, System: 238.5 ms]
8 cores = Time (mean ± σ): 204.6 ms ± 7.5 ms [User: 409.7 ms, System: 324.3 ms]
16 cores = Time (mean ± σ): 165.3 ms ± 6.0 ms [User: 557.6 ms, System: 484.6 ms]
24 cores = Time (mean ± σ): 164.8 ms ± 5.1 ms [User: 608.8 ms, System: 616.5 ms]

If I don't use taskset, then I get the following for all my cores:
Time (mean ± σ): 160.8 ms ± 4.2 ms [User: 818.5 ms, System: 1271.7 ms]

That sounds like an over-commit of CPU cores, is it possible to happen?

@davidlattimore
Copy link
Owner

That sounds like an over-commit of CPU cores, is it possible to happen?

Do you mean running with more active threads than the value of the --threads= parameter? It shouldn't. Here's a profile of wild running with --threads=4:

https://share.firefox.dev/4dDwg5l

The worker pool has 4 threads. The main thread makes it 5 threads, but when the main thread is doing work, at most one of the worker pool threads is working.

Or perhaps you were asking something different?

@davidlattimore
Copy link
Owner

One thing you might notice from the above profile is that building the symbol DB and string merging are doing a bit of work single-threaded. In both cases, I do as much work as I can from multiple threads. In particular, hashing the symbols and strings is all done from multiple threads, so the main thread only needs to populate the hashmap with prehashed values. Nevertheless, it's still a potential bottleneck. I'm going to experiment with dashmap to see if building these hashmaps from multiple threads is an overall win or loss.

@marxin
Copy link
Contributor Author

marxin commented Aug 13, 2024

Do you mean running with more active threads than the value of the --threads= parameter?

Yeah, oh, I didn't know Wild already supported --threads argument. Nice improvement!

@marxin
Copy link
Contributor Author

marxin commented Sep 13, 2024

There are newer numbers for builds that include debug info: #117.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants