Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wild performance with debug info support #117

Open
marxin opened this issue Sep 10, 2024 · 18 comments
Open

Wild performance with debug info support #117

marxin opened this issue Sep 10, 2024 · 18 comments

Comments

@marxin
Copy link
Collaborator

marxin commented Sep 10, 2024

Once the recent support for DWARF has been added, we're falling behind mold linker when it comes to linking of something huge (like clang):

❯ bloaty ../../../../bin/clang-20
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.5%  3.31Gi   0.0%       0    .debug_info
   9.4%   427Mi   0.0%       0    .debug_loclists
   5.1%   231Mi   0.0%       0    .debug_str
   4.9%   220Mi   0.0%       0    .debug_line
   2.0%  91.3Mi  61.6%  91.3Mi    .text
   1.7%  76.4Mi   0.0%       0    .debug_rnglists
   0.9%  41.5Mi  28.0%  41.5Mi    .rodata
   0.5%  23.6Mi   0.0%       0    .debug_abbrev
   0.5%  21.7Mi   0.0%       0    .strtab
   0.2%  9.01Mi   6.1%  9.01Mi    .eh_frame
   0.1%  4.89Mi   0.0%       0    .symtab
   0.1%  4.48Mi   3.0%  4.48Mi    .data
   0.1%  3.64Mi   0.0%       0    .debug_aranges
   0.0%  1.09Mi   0.7%  1.09Mi    .eh_frame_hdr
   0.0%       0   0.5%   704Ki    .bss
   0.0%   365Ki   0.0%       0    .debug_line_str
   0.0%  43.4Ki   0.0%  43.4Ki    .rela.dyn
   0.0%  14.8Ki   0.0%  14.4Ki    [17 Others]
   0.0%  8.21Ki   0.0%  8.21Ki    .dynstr
   0.0%  7.62Ki   0.0%  7.62Ki    .dynsym
   0.0%  5.77Ki   0.0%  5.77Ki    .got
 100.0%  4.44Gi 100.0%   148Mi    TOTAL

> wild ...
┌───    2.43 Open input files
├───    1.62 Split archives
├───    3.53 Parse input files
│ ┌───    2.76 Read symbols
│ ├───    7.66 Populate symbol map
├─┴─   11.47 Build symbol DB
│ ┌─── 1719.15 Resolve symbols
│ ├───    1.25 Assign section IDs
│ ├─── 2206.91 Merge strings
│ ├───    0.04 Canonicalise undefined symbols
│ ├───    6.99 Resolve alternative symbol definitions
├─┴─ 3934.42 Symbol resolution
│ ┌───   28.55 Find required sections
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    2.53 Finalise per-object sizes
│ ├───    0.02 Apply non-addressable indexes
│ ├───    0.15 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.30 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    5.93 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─   37.67 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.66 Split output buffers by group
│ ├─┴─ 2962.53 Write data to file
│ ├───    3.33 Sort .eh_frame_hdr
├─┴─ 2965.95 Write output file
│ ┌───  180.89 Unmap output file
│ ├───   22.78 Drop layout
│ ├───    0.43 Drop symbol DB
│ ├───  243.71 Drop input data
├─┴─  447.86 Shutdown
└─ 7513.08 Link

real	0m7.684s

Mold runs for:

real	0m3.053s
user	0m0.024s
sys	0m0.008s

As seen, a huge part of the linking is spent in the single-threaded Merge strings, where we have to deal with the following huge section:

2024-09-10T08:42:31.369525Z DEBUG Link:Symbol resolution:Merge strings: wild_lib::resolution: merge_strings section=.debug_str size=242491445 sec.totally_added=6249233166 strings=2600883

That is 231MiB where the total input size before deduplication is 6GiB.

Related to: #37.

@marxin
Copy link
Collaborator Author

marxin commented Sep 10, 2024

If I invoke Mold with -Wl,--compress-debug-sections=zstd, the linker runs for about 4.1s and the output binary shrinks to:

❯ bloaty ../../../../bin/clang-20
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  78.7%  1.61Gi   0.0%       0    .debug_info
   5.4%   112Mi   0.0%       0    .debug_loclists
   4.6%  95.8Mi  59.8%  95.8Mi    .text
   2.6%  54.5Mi   0.0%       0    .debug_str
   2.5%  53.1Mi   0.0%       0    .debug_line
   1.8%  37.9Mi  23.7%  37.9Mi    .rodata
   1.2%  25.3Mi   0.0%       0    .strtab
   1.2%  24.7Mi   0.0%       0    .debug_rnglists
   0.5%  9.72Mi   6.1%  9.72Mi    .eh_frame
   0.4%  8.50Mi   0.0%       0    .symtab
   0.3%  6.00Mi   0.0%       0    .debug_abbrev
   0.2%  4.49Mi   2.8%  4.49Mi    .dynstr
   0.2%  4.22Mi   2.6%  4.22Mi    .data.rel.ro
   0.1%  1.81Mi   1.1%  1.81Mi    .rodata.str1.1
   0.1%  1.54Mi   1.0%  1.54Mi    .rodata.str1.8
   0.1%  1.30Mi   0.8%  1.30Mi    .dynsym
   0.1%  1.24Mi   0.8%  1.24Mi    .eh_frame_hdr
   0.1%  1.06Mi   0.6%  1015Ki    [33 Others]
   0.0%   839Ki   0.0%       0    .debug_aranges
   0.0%       0   0.4%   715Ki    .bss
   0.0%   444Ki   0.3%   444Ki    .hash
 100.0%  2.04Gi 100.0%   160Mi    TOTAL

@davidlattimore
Copy link
Owner

We discussed (via a video call) the possibility of splitting string merge sections into buckets based on part of the hash of the string. We could then build a hashmap per bucket and also do writing per bucket. One issue is that in order to ensure deterministic behaviour, the strings in the bucket would need to be sorted prior to assigning indexes to the strings. Another potential issue is that it might not play nicely with caches, since adjacent strings from the input files will likely end up in different buckets. Even if they don't, they're not likely to get written in order.

Another possibility is to use a concurrent hashmap. We can't allocate offsets to merged strings like we do now when we insert into the merged-strings hashmap because that would result in non-determinism. We could instead store as the value, which input file was selected to "own" that string. Later, when another file is found to have the same string, it might or might not replace the previous "owner" of the string depending on some deterministic selection. This has the advantage that a file can then write all the strings that it owns in the order in which they appear in the input file, which is likely to give better cache performance. If a file owns multiple adjacent strings, then it could potentially write them all with a single memcpy.

marxin added a commit to marxin/wild that referenced this issue Sep 10, 2024
marxin added a commit to marxin/wild that referenced this issue Sep 10, 2024
marxin added a commit to marxin/wild that referenced this issue Sep 10, 2024
davidlattimore pushed a commit that referenced this issue Sep 10, 2024
@marxin
Copy link
Collaborator Author

marxin commented Sep 13, 2024

After the recent changes, we get to:

┌───    1.82 Open input files
├───    1.13 Split archives
├───    3.78 Parse input files
│ ┌───    2.90 Read symbols
│ ├───    4.71 Populate symbol map
├─┴─    8.80 Build symbol DB
│ ┌─── 1238.77 Resolve symbols
│ ├───    1.81 Assign section IDs
│ ├───  659.97 Merge strings
│ ├───    0.03 Canonicalise undefined symbols
│ ├───    6.73 Resolve alternative symbol definitions
├─┴─ 1907.36 Symbol resolution
│ ┌───   46.31 Find required sections
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    2.15 Finalise per-object sizes
│ ├───    0.03 Apply non-addressable indexes
│ ├───    0.17 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.35 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    5.78 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─   55.12 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    1.21 Split output buffers by group
│ ├─┴─ 2265.57 Write data to file
│ ├───    3.67 Sort .eh_frame_hdr
├─┴─ 2269.36 Write output file
│ ┌───  158.30 Unmap output file
│ ├───   19.11 Drop layout
│ ├───    0.27 Drop symbol DB
│ ├───  134.21 Drop input data
├─┴─  311.95 Shutdown
└─ 4581.64 Link

real	0m4.746s
user	1m12.046s
sys	0m13.231s

@davidlattimore
Copy link
Owner

I assume there's still plenty of opportunity for further optimisation with writing of merged strings? I'd also be interested to try the concurrent hashmap approach that I mentioned above at some point and see how it compares.

I should probably do a debug build of clang myself so that I can run this benchmark too.

@marxin
Copy link
Collaborator Author

marxin commented Sep 14, 2024

I assume there's still plenty of opportunity for further optimisation with writing of merged strings?

I hope there's still room for improvement :)

I'd also be interested to try the concurrent hashmap approach that I mentioned above at some point and see how it compares.

I am really curious if it can help us. We might want to start fill-up the concurrent hash map as early as possible, ideally in the find_required_sections function. I was also thinking about storing a HashMap<u64,64> in MergeStringsFileSection which will keep mapping from an offset within a MergeStringsFileSection to the offset in the final merged string section. Having that map we won't have to build the StringToMerge instances in get_merged_string_output_address.

I should probably do a debug build of clang myself so that I can run this benchmark too.

It's pretty simple, just clone the repository and run:

mkdir build
cd build
cmake ../llvm -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_PROJECTS=clang
make clang -j123

@davidlattimore
Copy link
Owner

Wow, debug info really makes a pretty massive difference doesn't it. With the same input files, linking clang on my laptop with --strip-debug takes 300 ms. Without --strip-debug it takes 16 to 19 seconds. I even saw some runs take as long as 27 seconds. I'm not sure why the variation is so large. Profiles seem to show that during writing there are periods of up to a second when none of the threads appear to be doing anything. Peak RSS seems to be around 24G, which is a lot, but still well within the 42G of RAM in my laptop.

I was also thinking about storing a HashMap<u64,64> in MergeStringsFileSection which will keep mapping from an offset within a MergeStringsFileSection to the offset in the final merged string section. Having that map we won't have to build the StringToMerge instances in get_merged_string_output_address.

That sounds like it could help quite a bit. Certainly get_merged_string_output_address is showing up as a pretty substantial proportion of the runtime.

I'll hold of making any optimisations in this area while you are, so as to avoid merge conflicts.

@marxin
Copy link
Collaborator Author

marxin commented Sep 15, 2024

Wow, debug info really makes a pretty massive difference doesn't it. With the same input files, linking clang on my laptop with --strip-debug takes 300 ms. Without --strip-debug it takes 16 to 19 seconds.

Oh yeah, debug info tends to be pretty large and brings very many relocations and a huge pool of strings from the string-merge section. The described delay might be related to a disk congression where you can't effectively utilize your CPU, right?

That sounds like it could help quite a bit. Certainly get_merged_string_output_address is showing up as a pretty substantial proportion of the runtime.

I'll hold of making any optimisations in this area while you are, so as to avoid merge conflicts.

Feel free to do so. I've tried a couple of experiments and no one seemed as a promising one :) Right now, I don't have any temporary patchset that I would need to rebase.

davidlattimore added a commit that referenced this issue Sep 16, 2024
On my laptop, this reduces the time to link clang with debug info by about 39%.

Issue #117
davidlattimore added a commit that referenced this issue Sep 16, 2024
On my laptop, this reduces the time to link clang with debug info by about 39%.

Issue #117
davidlattimore added a commit that referenced this issue Sep 16, 2024
On my laptop, this reduces the time to link clang with debug info by about 39%.

Issue #117
@davidlattimore
Copy link
Owner

I added a cache for merged string lookups - somewhat similar to what you described. It did indeed improve performance. Link time on my laptop for a debug build of clang reduced by about 39%. I'm going to experiment with a concurrent hashmap to see if we can reduce the time further - that's a more drastic change though.

@marxin
Copy link
Collaborator Author

marxin commented Sep 16, 2024

Oh, great, the cache basically aligns with my idea. I can confirm to the improvement on my machine:

❯ hyperfine 'g++ @cmd -B ~/Programming/wild' 'g++ @cmd -B /tmp/'
Benchmark 1: g++ @cmd -B ~/Programming/wild
  Time (mean ± σ):      4.626 s ±  0.242 s    [User: 72.387 s, System: 12.711 s]
  Range (min … max):    4.288 s …  4.988 s    10 runs
 
Benchmark 2: g++ @cmd -B /tmp/
  Time (mean ± σ):      3.650 s ±  0.118 s    [User: 36.700 s, System: 26.740 s]
  Range (min … max):    3.445 s …  3.811 s    10 runs
 
Summary
  g++ @cmd -B /tmp/ ran
    1.27 ± 0.08 times faster than g++ @cmd -B ~/Programming/wild

The latter one is the current main with the caching. I'm really curious about the concurrent hashmap.

@davidlattimore
Copy link
Owner

I got things working with the concurrent hashmap. I used dashmap. It was a pretty substantial change that added a moderate amount of complexity. I would have been OK with that if it has sped things up, but it ended up slowing string merging moderately, so I'm going to give up on that approach. I've got more ideas that I'd like to try out, however I should probably prepare my talk for gosim 2024 first, so it might be awhile before I get back to this.

@marxin
Copy link
Collaborator Author

marxin commented Sep 20, 2024

Quite interesting results. Would you mind sharing the git branch with the concurrent hashmap?

@davidlattimore
Copy link
Owner

@marxin
Copy link
Collaborator Author

marxin commented Sep 23, 2024

I've just measured the performance for Clang compiler build (w/ debug info), and Wild after 54adc37 is about 15% slower than before the revision which added the caching for string merging section (0282c13).

What shall we do, @davidlattimore?

davidlattimore added a commit that referenced this issue Sep 23, 2024
When it was added, it appeared to speed up string merging, however it was buggy
and once the bug(s) were fixed, it turns out that it actually makes things
slower.

Issue #117
davidlattimore added a commit that referenced this issue Sep 24, 2024
When it was added, it appeared to speed up string merging, however it was buggy
and once the bug(s) were fixed, it turns out that it actually makes things
slower.

Issue #117
@davidlattimore
Copy link
Owner

Thanks for noticing that. You're right. I've removed the string-merge caching and saw a speedup. I think the speedup I got on my laptop was closer to 5%. I guess we were getting good performance before because we were getting such a good cache hit rate and we were getting a good cache hit rate because there were lots of cache hits that shouldn't have been cache hits.

We're possibly still a little bit slower than we were before the caching was introduced. At first I thought that maybe this was my change to resolve sections separately from resolving symbols, but now I don't think this is actually the case.

Looking at the output of samply, I'm seeing a period of time when all the threads are idle except the one thread that's deleting the old output file. It wasn't like this before. I think what's happening, is symbol resolution tries to spawn N tasks, where N is the size of our thread pool. One of those tasks can't start because there's a thread being used for file deletion. I'm unsure how best to fix that. A few ideas are:

  • Use a new thread for file deletion rather than using a thread from the rayon threadpool.
  • Spawn N-1 tasks when doing symbol resolution
  • Instead of deleting the file, rename it, then delete the renamed file at the same time as we create the new file. This is after the last time when we spawn N tasks (in the "Find required sections" phase).
  • Rename the file then delete it from a detached subprocess that could be launched just before we shutdown.
  • Propose and add a new API to rayon so that we don't need to do the spawn-N-tasks thing. I'm thinking some kind of growable work queue, however unlike rayon::spawn, operating on homogenous data, so that the work items don't need to each have a separate heap allocation.

@marxin
Copy link
Collaborator Author

marxin commented Sep 26, 2024

I can confirm that after b009335, the Wild linker is about 12% faster on Clang benchmark. Good job.

marxin pushed a commit to marxin/wild that referenced this issue Sep 27, 2024
When it was added, it appeared to speed up string merging, however it was buggy
and once the bug(s) were fixed, it turns out that it actually makes things
slower.

Issue davidlattimore#117
@marxin
Copy link
Collaborator Author

marxin commented Nov 3, 2024

I made an investigation related to split-debuginfo=packed and the results are auspicious on Linux:

  • for each object file a .dwo is created and the file bypasses a linker
  • these files are merged into a .dwp (DWARF package file) by a Rust utility: https://github.com/rust-lang/thorin
  • Thorin does the expensive string merging operation ;)

I've collected some stats for ruff binary:

packed:

bloaty target/debug/ruff
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  30.8%  38.8Mi  70.5%  38.8Mi    .text
  18.4%  23.1Mi   0.0%       0    .debug_line
  12.3%  15.5Mi   0.0%       0    .strtab
  11.6%  14.6Mi   0.0%       0    .debug_ranges
...
 100.0%   125Mi 100.0%  55.0Mi    TOTAL
❯ bloaty target/debug/ruff.dwp
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  47.3%  83.7Mi   NAN%       0    .debug_info.dwo
  43.1%  76.2Mi   NAN%       0    .debug_str.dwo
   7.6%  13.4Mi   NAN%       0    .debug_str_offsets.dwo
...
 100.0%   176Mi 100.0%       0    TOTAL

As seen, there are still some DWARF related sections that live in the main binary, but the majority goes to the .dwo object files. In total about ~25% of the DWARF info remains in the main binary.

The normal split-debuginfo mode has the following size:

❯ bloaty target/debug/ruff
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  36.4%   110Mi   0.0%       0    .debug_info
  24.0%  72.7Mi   0.0%       0    .debug_str
  12.8%  38.8Mi  70.5%  38.8Mi    .text
   7.7%  23.2Mi   0.0%       0    .debug_line
...
 100.0%   302Mi 100.0%  55.0Mi    TOTAL

The total file size is pretty comparable.

@davidlattimore
Copy link
Owner

Nice. Is there anything needing to change in Wild in order to support this, or does it just work? I just tried with:

[profile.dev]
split-debuginfo = "packed"

and also with unpacked. Both appeared to work. So I guess the compiler just decides what stuff to put in the .o files vs .dwo files and the linker doesn't need to do anything special?

@marxin
Copy link
Collaborator Author

marxin commented Nov 4, 2024

Yes, it works out of the box :) Rust'c Thorin tool takes care of creating the final packed DWARF package (.dwp).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants