-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wild performance with debug info support #117
Comments
If I invoke Mold with
|
We discussed (via a video call) the possibility of splitting string merge sections into buckets based on part of the hash of the string. We could then build a hashmap per bucket and also do writing per bucket. One issue is that in order to ensure deterministic behaviour, the strings in the bucket would need to be sorted prior to assigning indexes to the strings. Another potential issue is that it might not play nicely with caches, since adjacent strings from the input files will likely end up in different buckets. Even if they don't, they're not likely to get written in order. Another possibility is to use a concurrent hashmap. We can't allocate offsets to merged strings like we do now when we insert into the merged-strings hashmap because that would result in non-determinism. We could instead store as the value, which input file was selected to "own" that string. Later, when another file is found to have the same string, it might or might not replace the previous "owner" of the string depending on some deterministic selection. This has the advantage that a file can then write all the strings that it owns in the order in which they appear in the input file, which is likely to give better cache performance. If a file owns multiple adjacent strings, then it could potentially write them all with a single |
After the recent changes, we get to:
|
I assume there's still plenty of opportunity for further optimisation with writing of merged strings? I'd also be interested to try the concurrent hashmap approach that I mentioned above at some point and see how it compares. I should probably do a debug build of clang myself so that I can run this benchmark too. |
I hope there's still room for improvement :)
I am really curious if it can help us. We might want to start fill-up the concurrent hash map as early as possible, ideally in the
It's pretty simple, just clone the repository and run:
|
Wow, debug info really makes a pretty massive difference doesn't it. With the same input files, linking clang on my laptop with
That sounds like it could help quite a bit. Certainly I'll hold of making any optimisations in this area while you are, so as to avoid merge conflicts. |
Oh yeah, debug info tends to be pretty large and brings very many relocations and a huge pool of strings from the string-merge section. The described delay might be related to a disk congression where you can't effectively utilize your CPU, right?
Feel free to do so. I've tried a couple of experiments and no one seemed as a promising one :) Right now, I don't have any temporary patchset that I would need to rebase. |
On my laptop, this reduces the time to link clang with debug info by about 39%. Issue #117
On my laptop, this reduces the time to link clang with debug info by about 39%. Issue #117
On my laptop, this reduces the time to link clang with debug info by about 39%. Issue #117
I added a cache for merged string lookups - somewhat similar to what you described. It did indeed improve performance. Link time on my laptop for a debug build of clang reduced by about 39%. I'm going to experiment with a concurrent hashmap to see if we can reduce the time further - that's a more drastic change though. |
Oh, great, the cache basically aligns with my idea. I can confirm to the improvement on my machine:
The latter one is the current main with the caching. I'm really curious about the concurrent hashmap. |
I got things working with the concurrent hashmap. I used dashmap. It was a pretty substantial change that added a moderate amount of complexity. I would have been OK with that if it has sped things up, but it ended up slowing string merging moderately, so I'm going to give up on that approach. I've got more ideas that I'd like to try out, however I should probably prepare my talk for gosim 2024 first, so it might be awhile before I get back to this. |
Quite interesting results. Would you mind sharing the git branch with the concurrent hashmap? |
Sure, I've uploaded it here: https://github.com/davidlattimore/wild/compare/string-merge-concurrent-hashmap?expand=1 |
I've just measured the performance for Clang compiler build (w/ debug info), and Wild after 54adc37 is about 15% slower than before the revision which added the caching for string merging section (0282c13). What shall we do, @davidlattimore? |
When it was added, it appeared to speed up string merging, however it was buggy and once the bug(s) were fixed, it turns out that it actually makes things slower. Issue #117
When it was added, it appeared to speed up string merging, however it was buggy and once the bug(s) were fixed, it turns out that it actually makes things slower. Issue #117
Thanks for noticing that. You're right. I've removed the string-merge caching and saw a speedup. I think the speedup I got on my laptop was closer to 5%. I guess we were getting good performance before because we were getting such a good cache hit rate and we were getting a good cache hit rate because there were lots of cache hits that shouldn't have been cache hits. We're possibly still a little bit slower than we were before the caching was introduced. At first I thought that maybe this was my change to resolve sections separately from resolving symbols, but now I don't think this is actually the case. Looking at the output of
|
I can confirm that after b009335, the Wild linker is about 12% faster on Clang benchmark. Good job. |
When it was added, it appeared to speed up string merging, however it was buggy and once the bug(s) were fixed, it turns out that it actually makes things slower. Issue davidlattimore#117
I made an investigation related to
I've collected some stats for packed:
As seen, there are still some DWARF related sections that live in the main binary, but the majority goes to the The normal split-debuginfo mode has the following size:
The total file size is pretty comparable. |
Nice. Is there anything needing to change in Wild in order to support this, or does it just work? I just tried with:
and also with |
Yes, it works out of the box :) Rust'c Thorin tool takes care of creating the final packed DWARF package (.dwp). |
Once the recent support for DWARF has been added, we're falling behind mold linker when it comes to linking of something huge (like
clang
):Mold runs for:
As seen, a huge part of the linking is spent in the single-threaded
Merge strings
, where we have to deal with the following huge section:2024-09-10T08:42:31.369525Z DEBUG Link:Symbol resolution:Merge strings: wild_lib::resolution: merge_strings section=.debug_str size=242491445 sec.totally_added=6249233166 strings=2600883
That is
231MiB
where the total input size before deduplication is 6GiB.Related to: #37.
The text was updated successfully, but these errors were encountered: