`ccall` tests take a very long time to run on Buildkite, but only when running under `rr` #42242

DilumAluthge · 2021-09-13T20:54:50Z

Description	Running under `rr`?	`ccall` tests	Logs	`rr` trace files
`amdci5`, manual	No	0m 40s
`amdci5`, manual	Yes	0m 52s
`amdci5`, Buildkite	No	1m 2s	View
`amdci5`, Buildkite	Yes	28m 34s	View	Download (includes other test sets in addition to `ccall`)
`amdci5`, Buildkite	Yes	22m 4s	View	Download (only includes the `ccall` test set)
Buildbot	Yes	2m 34s	View

Notes:

In all rows, we are running on Linux x86_64 with a 64-bit build of Julia.
In all rows, the ccall tests pass.
manual means that we are not using any kind of sandboxing.

The text was updated successfully, but these errors were encountered:

staticfloat · 2021-09-14T21:47:45Z

Looking at rr ps for a slow ccall test suite run in isolation [0] (trace available here [1]), I get the following output for rr ps:

$ ~/src/rr/bin/rr ps .
PID     PPID    EXIT    CMD
315     --      0       julia-b18d471484/bin/julia -e Base.runtests(["ccall"]; ncores = Sys.CPU_THREADS)
339     315     0       /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/julia -Cnative -J/cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/lib/julia/sys.so -g1 /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/../share/julia/test/runtests.jl ccall
367     339     0       /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/julia -Cnative -J/cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/lib/julia/sys.so -g1 --startup-file=no -e A = Ref{Cint}(42); finalizer(cglobal((:c_exit_finalizer, "libccalltest"), Cvoid), A)
391     339     1       /cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/bin/julia -Cnative -J/cache/build/amdci7-3/julialang/julia-master/julia-b18d471484/lib/julia/sys.so -g1 --color=no --output-o=/tmp/jl_GWZHvR -e 
        Base.reinit_stdio();
        f() = ccall((:dne, :does_not_exist), Cvoid, ());
        f()

I don't see anything that interesting in the ps output. Replaying the trace does indeed run slowly. Writing a basic rr dump analyzer [2], I found the following for a "slow" trace:

[ Info: Parsing 8min.dump....
[ Info: Parsed 97168 events
[ Info: Top twenty `rr` events by frequency:
`SCHED' => 55035
`SYSCALL: pwrite64' => 9178
`SYSCALLBUF_FLUSH' => 6799
`SYSCALLBUF_RESET' => 6799
`SYSCALL: mmap' => 4276
`SYSCALL: futex' => 2282
`SYSCALL: munmap' => 2224
`SYSCALL: brk' => 2224
`SYSCALL: mincore' => 1916
`SYSCALL: sched_yield' => 1250

And the following for a "fast" trace:

[ Info: Parsing 1min.dump....
[ Info: Parsed 38131 events
[ Info: Top twenty `rr` events by frequency:
`SYSCALL: pwrite64' => 9162
`SYSCALLBUF_FLUSH' => 5804
`SYSCALLBUF_RESET' => 5804
`SYSCALL: mmap' => 3494
`SYSCALL: brk' => 2218
`SYSCALL: munmap' => 2112
`SYSCALL: futex' => 2102
`SYSCALL: mincore' => 1908
`SYSCALL: sched_yield' => 1242
`SYSCALL: epoll_wait' => 576

So it looks like something is causing the "slow" version to have a huge number of SCHED events. I don't know what that means, but increase in number of events between the two traces is almost entirely due to the SCHED events.

[0] https://buildkite.com/julialang/julia-master/builds/3631#bd3067d3-e10e-4f46-bb55-8d7d25604ffc
[1] https://buildkite.com/organizations/julialang/pipelines/julia-master/builds/3631/jobs/bd3067d3-e10e-4f46-bb55-8d7d25604ffc/artifacts/347e748a-69f3-4b37-9c0a-ee92625abc07
[2] https://gist.github.com/staticfloat/a123f71ac19c2b4885adbb8aff2f3328

vchuravy · 2021-09-14T21:53:39Z

What is kernel.perf_event_paranoid on the buildbots? On AMDCI it should be set to kernel.perf_event_paranoid = 1

staticfloat · 2021-09-14T22:05:35Z

This is all being measured on AMDCI machines, and it is indeed set to 1.

staticfloat · 2021-09-14T22:06:16Z

@Keno your guess about scheduling problems seems to be on the right track. Do you know what a huge number of SCHED events means?

In case you want to look at this yourself, you can check out ~/slow_ccall_debugging on amdci5, it has trace directories for 1 minute and 8 minute runs of the ccall test suite (the fast version was captured manually, the slow one was taken from CI).

Keno · 2021-09-14T22:13:45Z

SCHED means it's running in userspace. With slowdowns this large what's usually happening is that there's a spinloop somewhere that busy waits for completion of an event it expects to happen in parallel on another thread, but because there are no other threads running it spins for a really long time. See e.g. f7e5f3e

staticfloat · 2021-09-14T22:59:27Z

The bizarre thing is that it doesn't seem to happen when not running on CI. I haven't been able to identify what might be different between the two environments. :(

Keno · 2021-09-14T23:00:32Z

Is there a filter running that blocks SYS_rrcall_check_presence, which is one of the high integer syscalls?

staticfloat · 2021-09-14T23:02:35Z

I don't think so; we don't filter syscalls in the sandbox, I don't think. And I've run it inside of the same rootfs on the same machine (even with a nested sandbox, like we do on CI) and it stays "fast".

vtjnash · 2021-09-14T23:43:00Z

Is buildkite still misconfigured to run all of the tests with massive numbers of threads?

staticfloat · 2021-09-14T23:44:09Z

Buildkite is configured to set JULIA_NUM_THREADS=16, which is much lower than the number of cores in the machine.

DilumAluthge · 2021-09-14T23:48:25Z

For rr we can consider lowering the number further, since (if I understand correctly), rr doesn't actually run multithreaded, but instead needs to simulate the multithreading. (Is that correct?)

So on the rr jobs only, we could do JULIA_NUM_THREADS=8 or even JULIA_NUM_THREADS=4.

DilumAluthge · 2021-09-14T23:59:01Z

We could also decrease the maximum possible value of ncores when running the tests with rr.

Right now, when running the tests with rr, we never allow the value of ncores to be greater than 8.

If we want, we could further decrease that upper bound.

staticfloat · 2021-09-15T01:12:46Z

Okay; so I have confirmed that setting JULIA_NUM_THREADS allows me to reproduce the slowdown, even outside of the sandbox! Huzzah!

The bad news is that, as far as I can tell, setting JULIA_NUM_THREADS shouldn't effect this at all, since the workload is single-threaded.

DilumAluthge · 2021-09-15T01:34:48Z

Okay; so I have confirmed that setting JULIA_NUM_THREADS allows me to reproduce the slowdown, even outside of the sandbox! Huzzah!

This is good news!

DilumAluthge · 2021-09-15T01:37:51Z

The weird part is that we also set JULIA_NUM_THREADS=16 on the non-rr Buildkite jobs (tester_linux64 without rr and tester_linux32 without rr), but we don't see the slowdown on those jobs.

vtjnash · 2021-09-15T04:21:28Z

It was single threaded, until you forced it to launch 16x the number of threads intended and turned it into a parallelism benchmark instead.

staticfloat · 2021-09-15T04:27:56Z

It was single threaded, until you forced it to launch 16x the number of threads intended and turned it into a parallelism benchmark instead.

What is the recommended way to ensure that Julia never launches more than N threads? I believe we added this environment variable because if we didn't, some parts of the test suite would launch as many workers as there are CPUs, which is clearly suboptimal as well.

vtjnash · 2021-09-15T15:54:53Z

So, in order to ensure it doesn't sometimes launch more than 128 threads sometimes, you force it to instead launch 256 always? I don't get this logic.

vtjnash · 2021-09-23T03:51:22Z

Closing as this was fixed in buildkite

DilumAluthge added ci Continuous integration bug Indicates an unexpected problem or unintended behavior labels Sep 13, 2021

DilumAluthge mentioned this issue Sep 14, 2021

CI (Buildkite): Add package_linux and tester_linux for more Linux architectures #41707

Merged

tkf mentioned this issue Sep 22, 2021

Use uv_thread_getaffinity when --threads=auto #42340

Merged

2 tasks

vtjnash closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ccall` tests take a very long time to run on Buildkite, but only when running under `rr` #42242

`ccall` tests take a very long time to run on Buildkite, but only when running under `rr` #42242

DilumAluthge commented Sep 13, 2021 •

edited

Loading

staticfloat commented Sep 14, 2021 •

edited

Loading

vchuravy commented Sep 14, 2021

staticfloat commented Sep 14, 2021

staticfloat commented Sep 14, 2021

Keno commented Sep 14, 2021

staticfloat commented Sep 14, 2021

Keno commented Sep 14, 2021

staticfloat commented Sep 14, 2021

vtjnash commented Sep 14, 2021

staticfloat commented Sep 14, 2021

DilumAluthge commented Sep 14, 2021

DilumAluthge commented Sep 14, 2021 •

edited

Loading

staticfloat commented Sep 15, 2021

DilumAluthge commented Sep 15, 2021 •

edited

Loading

DilumAluthge commented Sep 15, 2021

vtjnash commented Sep 15, 2021

staticfloat commented Sep 15, 2021

vtjnash commented Sep 15, 2021

vtjnash commented Sep 23, 2021

ccall tests take a very long time to run on Buildkite, but only when running under rr #42242

ccall tests take a very long time to run on Buildkite, but only when running under rr #42242

Comments

DilumAluthge commented Sep 13, 2021 • edited Loading

staticfloat commented Sep 14, 2021 • edited Loading

vchuravy commented Sep 14, 2021

staticfloat commented Sep 14, 2021

staticfloat commented Sep 14, 2021

Keno commented Sep 14, 2021

staticfloat commented Sep 14, 2021

Keno commented Sep 14, 2021

staticfloat commented Sep 14, 2021

vtjnash commented Sep 14, 2021

staticfloat commented Sep 14, 2021

DilumAluthge commented Sep 14, 2021

DilumAluthge commented Sep 14, 2021 • edited Loading

staticfloat commented Sep 15, 2021

DilumAluthge commented Sep 15, 2021 • edited Loading

DilumAluthge commented Sep 15, 2021

vtjnash commented Sep 15, 2021

staticfloat commented Sep 15, 2021

vtjnash commented Sep 15, 2021

vtjnash commented Sep 23, 2021

`ccall` tests take a very long time to run on Buildkite, but only when running under `rr` #42242

`ccall` tests take a very long time to run on Buildkite, but only when running under `rr` #42242

DilumAluthge commented Sep 13, 2021 •

edited

Loading

staticfloat commented Sep 14, 2021 •

edited

Loading

DilumAluthge commented Sep 14, 2021 •

edited

Loading

DilumAluthge commented Sep 15, 2021 •

edited

Loading