Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow incremental rebuilds of Nexus tests #1122

Open
Tracked by #1155
davepacheco opened this issue May 26, 2022 · 2 comments
Open
Tracked by #1155

slow incremental rebuilds of Nexus tests #1122

davepacheco opened this issue May 26, 2022 · 2 comments
Labels
build Building the control plane

Comments

@davepacheco
Copy link
Collaborator

davepacheco commented May 26, 2022

I found that my incremental rebuilds of Nexus tests seemed to be getting long (1m40s just now). I'm going to put some data in this issue but otherwise I'm not actively working on this.

@davepacheco
Copy link
Collaborator Author

I've got a profiling script, raw data, and flame graph in this gist repo.

I found that two ld steps were taking a while: one was about 17s and another was just over a minute. I dug into the longer one a bit.

I set LD_ALTEXEC to point to a custom ld-profile.sh script in the above-mentioned repo. This script saves its own arguments and environment (using pargs), then runs the real ld under DTrace to profile it.

Here are the output files from one run that took just over a minute: pid 2556:

$ ls -l 2556.*
-rw-r--r--   1 dap      staff      60159 May 26 13:18 2556.args
-rw-r--r--   1 dap      staff       3989 May 26 13:18 2556.env
-rw-r--r--   1 dap      staff     370680 May 26 13:19 2556.stacks
-rw-r--r--   1 dap      staff      53110 May 26 13:19 2556.stderr
-rw-r--r--   1 dap      staff          0 May 26 13:18 2556.stdout

$ cat 2556.stderr
dtrace: description 'profile-97' matched 1 probe
dtrace: pid 2560 has exited

real    1m2.108s
user    0m0.395s
sys     0m0.225s

(It's weird that the file is 53110 bytes and that's the entire contents. It has a ton of NUL bytes right before the time output. I haven't dug into this.)

I used Brendan's tools to make a flame graph:

$ scp iv:ld-tracing/altexec-output/2556.stacks ld-stacks-2556.out
2556.stacks                                                                                     100%  362KB   1.1MB/s   00:00    
$ ./stackcollapse.pl < ld-stacks-2556.out > ld-stacks-2556-collapsed.out
$ ./flamegraph.pl < ld-stacks-2556-collapsed.out > ld-stacks-2556.svg
$ 

As a sanity check: there appear to be 4889 samples in this output:

$ awk '{sum += $NF} END { print sum }' ld-stacks-2556-collapsed.out 
4889

Sampling at 97 Hz a single-threaded process that took 62s, if it were on-CPU the whole time, we'd expect 6,014 samples. That suggests only 81% of time was spent on-CPU in userland in ld. I'm surprised by this but haven't checked against prstat yet.

The flame graph is also in this repo. It looks pretty simple, but I'm not familiar enough with ld to say much useful. I see we're spending about 33% of time in ld64_place_section and another 25% of time in ld_strmerge_sec.

Note too that the command line includes 650 arguments (!) including 257 object files and 313 rlib files. The rlib files appear to correspond with Rust dependencies.

@davepacheco
Copy link
Collaborator Author

@richlowe suggested trying LD_OPTIONS=-znocompstrtab. That seems to cut the ld invocation from about 62s to about 35s. Thanks! That's a nice improvement.

@davepacheco davepacheco changed the title slow incremental rebuilds of Nexus integration tests slow incremental rebuilds of Nexus tests May 27, 2022
@smklein smklein added the build Building the control plane label Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Building the control plane
Projects
None yet
Development

No branches or pull requests

2 participants