-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI (Buildkite): Run the Distributed test suite with multithreading enabled #42479
CI (Buildkite): Run the Distributed test suite with multithreading enabled #42479
Conversation
@vchuravy Would it be feasible to make the Distributed test suite thread-safe? It's the only remaining test set that fails when multi-threading is enabled. |
The specific test failure is:
|
Making Distributed.jl thread-safe is an ongoing project and there are multiple open PRs trying to address parts of it. It will take a while to get there... I wonder if this is #42339 |
6d45646
to
a199ef4
Compare
5a242cc
to
76d6c36
Compare
With the following patch, the Distributed tests pass for me locally when multithreading is enabled. I've added that patch to this PR; let's see if the Distributed tests pass with multithreading on CI. diff --git a/stdlib/Distributed/test/distributed_exec.jl b/stdlib/Distributed/test/distributed_exec.jl
index 32b6b703f3..4063114df3 100644
--- a/stdlib/Distributed/test/distributed_exec.jl
+++ b/stdlib/Distributed/test/distributed_exec.jl
@@ -143,6 +143,8 @@ function test_futures_dgc(id)
@test fetch(f) == id
@test f.v !== nothing
yield(); # flush gc msgs
+ sleep(6);
+ yield();
@test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, fid) == false
@@ -153,6 +155,8 @@ function test_futures_dgc(id)
@test f.v === nothing
finalize(f)
yield(); # flush gc msgs
+ sleep(6);
+ yield();
@test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, fid) == false
end
@@ -243,6 +247,8 @@ function test_remoteref_dgc(id)
@test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, rrid) == true
finalize(rr)
yield(); # flush gc msgs
+ sleep(6);
+ yield();
@test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, rrid) == false
end
test_remoteref_dgc(id_me) |
Passes locally in multithreaded (-t16) on this branch #42339 Main worker with these threads is involved in that testing, so seems like there's some race on that cleanup |
If you skim Distributed.jl, you can spot a lot of data races (which may or may not be relevant in practice). I wonder if it makes sense to use the experimental pipeline? |
Yeah making Distributed.jl thread-safe is an ongoing project and an aspiration |
560c7fd
to
4c8e6c2
Compare
Why does this need to be backported? |
I've been backporting all of the CI PRs to 1.6 and 1.7, because it's the easiest and fastest way to get Buildkite set up across 1.6, 1.7, and master. And if we backport all the PRs, everything backports automatically and cleanly, and there's no need for manual intervention. Eventually, we will stop backporting CI PRs, and we will start maintaining separate Buildkite configurations on each of the release branches. But I'd like to delay that as much as possible. But, most importantly, we're not going to start running the Distributed test suite with multithreading on 1.6 or 1.7. We'll only do so starting with 1.8. The relevant bit from the Buildkite config: if [[ "$${BUILDKITE_PIPELINE_SLUG:?}" == "julia-release-1-dot-6" ]]; then
# On Julia 1.6, the Distributed test suite is expected to fail when multithreading is enabled.
$${JULIA_BINARY:?} -e 'Base.runtests(["all", "--skip", "Distributed"]; ncores = Sys.CPU_THREADS)'
elif [[ "$${BUILDKITE_PIPELINE_SLUG:?}" == "julia-release-1-dot-7" ]]; then
# On Julia 1.7, the Distributed test suite is expected to fail when multithreading is enabled.
$${JULIA_BINARY:?} -e 'Base.runtests(["all", "--skip", "Distributed"]; ncores = Sys.CPU_THREADS)'
else
$${JULIA_BINARY:?} -e 'Base.runtests(["all"]; ncores = Sys.CPU_THREADS)'
fi |
4c8e6c2
to
6a0dd48
Compare
6a0dd48
to
7d09a0e
Compare
So, making the entire Distributed.jl stdlib thread-safe is definitely a long-term project. This PR is much narrower. The goal of this PR is only to run the Distributed test suite in a setting in which It looks like the only change necessary to make this work was #42499. The Distributed test suite is now passing on linux64 with I say we merge this and see how it goes. If it ends up causing problems, we can revisit it. Note: we only run the Distributed test suite with Note however that we will always run the Distributed test suite when |
No description provided.