-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed & binary size comparison for Clang binary #25
Comments
Thanks for trying this out as well as the other issues you filed and the sponsorship! There's so many things to work on, so it can be hard sometimes to pick the next thing to tackle. It would indeed be interesting to see a comparison with debug info, so that's good feedback. One area where I've seen Wild perform less well is when there are lots of input objects. There's some bookkeeping that Wild does per-object and some of that bookkeeping is done on a single thread. I've been working the last week on some refactorings so that I can eventually move to do this bookkeeping on groups of objects rather than individual objects. I'm hopeful that this will improve performance in cases like this. |
I've finished the changes I mentioned to process files in groups. It indeed resulted in a speedup for linking a dev build of
It didn't speed up linking
Something interesting is that the relative timings that I get differ from what you get. For example, on my laptop, linking clang with There is one spot in the code where I allocate and initialise a large vec in a single thread. I've got some ideas that should allow it to be left uninitialised and leave the initialisation for the separate threads to do. |
I share your feelings! Yeah, the modern linkers have plenty of features, support various hardware targets, and run on multiple operating systems. For me personally, I think adding support for debug info should not be difficult (if you don't want to parse DWARF for things like GDB index), the only necessary thing is support for tombstone for dead symbols ( guess). That will show the full-speed potential of Wild as debug info tends to occupy a huge size portion of binaries. Plus, just recently, Mold added yet another background trick to postpone debug info streaming to a separate process ( |
After the recent changes, Wild is even doing better:
|
Nice! Unfortunately I regressed one benchmark with the earlier grouping-of-files change, since I end up with lots of large files in one shard. I'm working now on a making sharding more flexible so that I can have a variable number of files in each shard and base the shard size on number of symbols rather than number of objects. It's interesting how the split between user and system CPU is so different on your system than on mine. For me, all the linkers show more user time and less system time, whereas for you it's the other way around. e.g. this is what it looks like for me:
I guess it's probably due to differences in systems - e.g. kernel, CPU, filesystem etc. |
It's related to the number of used CPU cores. More cores, more synchronization work I guess: 4 cores = If I don't use That sounds like an over-commit of CPU cores, is it possible to happen? |
Do you mean running with more active threads than the value of the https://share.firefox.dev/4dDwg5l The worker pool has 4 threads. The main thread makes it 5 threads, but when the main thread is doing work, at most one of the worker pool threads is working. Or perhaps you were asking something different? |
One thing you might notice from the above profile is that building the symbol DB and string merging are doing a bit of work single-threaded. In both cases, I do as much work as I can from multiple threads. In particular, hashing the symbols and strings is all done from multiple threads, so the main thread only needs to populate the hashmap with prehashed values. Nevertheless, it's still a potential bottleneck. I'm going to experiment with dashmap to see if building these hashmaps from multiple threads is an overall win or loss. |
Yeah, oh, I didn't know Wild already supported |
There are newer numbers for builds that include debug info: #117. |
Kudos to the project, I can now link
clang-20
binary that is part of the LLVM project and is significantly big (~190 MiB w/o debug info). The binary can be quite easily built with the following series of commands:One needs to trim the
-rpath
linker flags to link the binary with Wild. The speed results on 24 CPU AMD Ryzen 9 7900X are the following (using--no-fork
for mold):The results are promising for the Wild linker which is on par with Mold. Similarly for binary size (both File size and VM Size as reported by
bloaty
tool):That said, I think Wild linker is doing well so far. What would be more interesting is the support for debug info, where binary like Clang has ~4.3 GiB and the linking process takes ~3 seconds for Mold linker (BFD is desperately slow).
The issue is not an issue, feel free to close it at any time ;)
The text was updated successfully, but these errors were encountered: