Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccall tests take a very long time to run on Buildkite, but only when running under rr #42242

Closed
DilumAluthge opened this issue Sep 13, 2021 · 19 comments
Labels
bug Indicates an unexpected problem or unintended behavior ci Continuous integration

Comments

@DilumAluthge
Copy link
Member

DilumAluthge commented Sep 13, 2021

Description Running under rr? ccall tests Logs rr trace files
amdci5, manual No 0m 40s
amdci5, manual Yes 0m 52s
amdci5, Buildkite No 1m 2s View
amdci5, Buildkite Yes 28m 34s View Download (includes other test sets in addition to ccall)
amdci5, Buildkite Yes 22m 4s View Download (only includes the ccall test set)
Buildbot Yes 2m 34s View

Notes:

  • In all rows, we are running on Linux x86_64 with a 64-bit build of Julia.
  • In all rows, the ccall tests pass.
  • manual means that we are not using any kind of sandboxing.
@DilumAluthge DilumAluthge added ci Continuous integration bug Indicates an unexpected problem or unintended behavior labels Sep 13, 2021
@staticfloat
Copy link
Member

staticfloat commented Sep 14, 2021

Looking at rr ps for a slow ccall test suite run in isolation [0] (trace available here [1]), I get the following output for rr ps:

$ ~/src/rr/bin/rr ps .
PID     PPID    EXIT    CMD
315     --      0       julia-b18d471484/bin/julia -e Base.runtests(["ccall"]; ncores = Sys.CPU_THREADS)
339     315     0       /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/julia -Cnative -J/cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/lib/julia/sys.so -g1 /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/../share/julia/test/runtests.jl ccall
367     339     0       /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/julia -Cnative -J/cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/lib/julia/sys.so -g1 --startup-file=no -e A = Ref{Cint}(42); finalizer(cglobal((:c_exit_finalizer, "libccalltest"), Cvoid), A)
391     339     1       /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/julia -Cnative -J/cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/lib/julia/sys.so -g1 --color=no --output-o=/tmp/jl_GWZHvR -e 
        Base.reinit_stdio();
        f() = ccall((:dne, :does_not_exist), Cvoid, ());
        f()

I don't see anything that interesting in the ps output. Replaying the trace does indeed run slowly. Writing a basic rr dump analyzer [2], I found the following for a "slow" trace:

[ Info: Parsing 8min.dump....
[ Info: Parsed 97168 events
[ Info: Top twenty `rr` events by frequency:
`SCHED' => 55035
`SYSCALL: pwrite64' => 9178
`SYSCALLBUF_FLUSH' => 6799
`SYSCALLBUF_RESET' => 6799
`SYSCALL: mmap' => 4276
`SYSCALL: futex' => 2282
`SYSCALL: munmap' => 2224
`SYSCALL: brk' => 2224
`SYSCALL: mincore' => 1916
`SYSCALL: sched_yield' => 1250

And the following for a "fast" trace:

[ Info: Parsing 1min.dump....
[ Info: Parsed 38131 events
[ Info: Top twenty `rr` events by frequency:
`SYSCALL: pwrite64' => 9162
`SYSCALLBUF_FLUSH' => 5804
`SYSCALLBUF_RESET' => 5804
`SYSCALL: mmap' => 3494
`SYSCALL: brk' => 2218
`SYSCALL: munmap' => 2112
`SYSCALL: futex' => 2102
`SYSCALL: mincore' => 1908
`SYSCALL: sched_yield' => 1242
`SYSCALL: epoll_wait' => 576

So it looks like something is causing the "slow" version to have a huge number of SCHED events. I don't know what that means, but increase in number of events between the two traces is almost entirely due to the SCHED events.

[0] https://buildkite.com/julialang/julia-master/builds/3631#bd3067d3-e10e-4f46-bb55-8d7d25604ffc
[1] https://buildkite.com/organizations/julialang/pipelines/julia-master/builds/3631/jobs/bd3067d3-e10e-4f46-bb55-8d7d25604ffc/artifacts/347e748a-69f3-4b37-9c0a-ee92625abc07
[2] https://gist.github.com/staticfloat/a123f71ac19c2b4885adbb8aff2f3328

@vchuravy
Copy link
Member

What is kernel.perf_event_paranoid on the buildbots? On AMDCI it should be set to kernel.perf_event_paranoid = 1

@staticfloat
Copy link
Member

This is all being measured on AMDCI machines, and it is indeed set to 1.

@staticfloat
Copy link
Member

@Keno your guess about scheduling problems seems to be on the right track. Do you know what a huge number of SCHED events means?

In case you want to look at this yourself, you can check out ~/slow_ccall_debugging on amdci5, it has trace directories for 1 minute and 8 minute runs of the ccall test suite (the fast version was captured manually, the slow one was taken from CI).

@Keno
Copy link
Member

Keno commented Sep 14, 2021

SCHED means it's running in userspace. With slowdowns this large what's usually happening is that there's a spinloop somewhere that busy waits for completion of an event it expects to happen in parallel on another thread, but because there are no other threads running it spins for a really long time. See e.g. f7e5f3e

@staticfloat
Copy link
Member

The bizarre thing is that it doesn't seem to happen when not running on CI. I haven't been able to identify what might be different between the two environments. :(

@Keno
Copy link
Member

Keno commented Sep 14, 2021

Is there a filter running that blocks SYS_rrcall_check_presence, which is one of the high integer syscalls?

@staticfloat
Copy link
Member

I don't think so; we don't filter syscalls in the sandbox, I don't think. And I've run it inside of the same rootfs on the same machine (even with a nested sandbox, like we do on CI) and it stays "fast".

@vtjnash
Copy link
Member

vtjnash commented Sep 14, 2021

Is buildkite still misconfigured to run all of the tests with massive numbers of threads?

@staticfloat
Copy link
Member

Buildkite is configured to set JULIA_NUM_THREADS=16, which is much lower than the number of cores in the machine.

@DilumAluthge
Copy link
Member Author

For rr we can consider lowering the number further, since (if I understand correctly), rr doesn't actually run multithreaded, but instead needs to simulate the multithreading. (Is that correct?)

So on the rr jobs only, we could do JULIA_NUM_THREADS=8 or even JULIA_NUM_THREADS=4.

@DilumAluthge
Copy link
Member Author

DilumAluthge commented Sep 14, 2021

We could also decrease the maximum possible value of ncores when running the tests with rr.

Right now, when running the tests with rr, we never allow the value of ncores to be greater than 8.

If we want, we could further decrease that upper bound.

@staticfloat
Copy link
Member

Okay; so I have confirmed that setting JULIA_NUM_THREADS allows me to reproduce the slowdown, even outside of the sandbox! Huzzah!

The bad news is that, as far as I can tell, setting JULIA_NUM_THREADS shouldn't effect this at all, since the workload is single-threaded.

@DilumAluthge
Copy link
Member Author

DilumAluthge commented Sep 15, 2021

Okay; so I have confirmed that setting JULIA_NUM_THREADS allows me to reproduce the slowdown, even outside of the sandbox! Huzzah!

This is good news!

@DilumAluthge
Copy link
Member Author

The weird part is that we also set JULIA_NUM_THREADS=16 on the non-rr Buildkite jobs (tester_linux64 without rr and tester_linux32 without rr), but we don't see the slowdown on those jobs.

@vtjnash
Copy link
Member

vtjnash commented Sep 15, 2021

It was single threaded, until you forced it to launch 16x the number of threads intended and turned it into a parallelism benchmark instead.

@staticfloat
Copy link
Member

It was single threaded, until you forced it to launch 16x the number of threads intended and turned it into a parallelism benchmark instead.

What is the recommended way to ensure that Julia never launches more than N threads? I believe we added this environment variable because if we didn't, some parts of the test suite would launch as many workers as there are CPUs, which is clearly suboptimal as well.

@vtjnash
Copy link
Member

vtjnash commented Sep 15, 2021

So, in order to ensure it doesn't sometimes launch more than 128 threads sometimes, you force it to instead launch 256 always? I don't get this logic.

@vtjnash
Copy link
Member

vtjnash commented Sep 23, 2021

Closing as this was fixed in buildkite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior ci Continuous integration
Projects
None yet
Development

No branches or pull requests

5 participants