Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI (Buildkite): Run the Distributed test suite with multithreading enabled #42479

Merged
merged 1 commit into from
Oct 15, 2021

Conversation

DilumAluthge
Copy link
Member

No description provided.

@DilumAluthge DilumAluthge added ci Continuous integration backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 3, 2021
@DilumAluthge
Copy link
Member Author

DilumAluthge commented Oct 3, 2021

@vchuravy Would it be feasible to make the Distributed test suite thread-safe? It's the only remaining test set that fails when multi-threading is enabled.

@DilumAluthge
Copy link
Member Author

DilumAluthge commented Oct 3, 2021

The specific test failure is:

Distributed                         (1) |        started at 2021-10-03T03:41:05.590
Test Failed at /cache/build/amdci7-2/julialang/julia-master/julia-bccbf7b7cd/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:146
  Expression: remotecall_fetch((k->begin
                yield()
                haskey(Distributed.PGRP.refs, k)
            end), id, fid) == false
   Evaluated: true == false
ERROR: LoadError: There was an error during testing
in expression starting at /cache/build/amdci7-2/julialang/julia-master/julia-bccbf7b7cd/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:160
Distributed                         (1) |         failed at 2021-10-03T03:41:22.950

@vchuravy
Copy link
Member

vchuravy commented Oct 4, 2021

Making Distributed.jl thread-safe is an ongoing project and there are multiple open PRs trying to address parts of it. It will take a while to get there...

I wonder if this is #42339

@DilumAluthge DilumAluthge force-pushed the dpa/buildkite-distributed-tests-multithreading branch from 6d45646 to a199ef4 Compare October 4, 2021 20:36
@DilumAluthge DilumAluthge changed the title CI (Buildkite): Run the Distributed test suite with multi-threading enabled CI (Buildkite): Run the Distributed test suite with multithreading enabled Oct 4, 2021
@DilumAluthge DilumAluthge force-pushed the dpa/buildkite-distributed-tests-multithreading branch 2 times, most recently from 5a242cc to 76d6c36 Compare October 4, 2021 21:09
@DilumAluthge
Copy link
Member Author

With the following patch, the Distributed tests pass for me locally when multithreading is enabled. I've added that patch to this PR; let's see if the Distributed tests pass with multithreading on CI.

diff --git a/stdlib/Distributed/test/distributed_exec.jl b/stdlib/Distributed/test/distributed_exec.jl
index 32b6b703f3..4063114df3 100644
--- a/stdlib/Distributed/test/distributed_exec.jl
+++ b/stdlib/Distributed/test/distributed_exec.jl
@@ -143,6 +143,8 @@ function test_futures_dgc(id)
     @test fetch(f) == id
     @test f.v !== nothing
     yield(); # flush gc msgs
+    sleep(6);
+    yield();
     @test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, fid) == false


@@ -153,6 +155,8 @@ function test_futures_dgc(id)
     @test f.v === nothing
     finalize(f)
     yield(); # flush gc msgs
+    sleep(6);
+    yield();
     @test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, fid) == false
 end

@@ -243,6 +247,8 @@ function test_remoteref_dgc(id)
     @test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, rrid) == true
     finalize(rr)
     yield(); # flush gc msgs
+    sleep(6);
+    yield();
     @test remotecall_fetch(k->(yield();haskey(Distributed.PGRP.refs, k)), id, rrid) == false
 end
 test_remoteref_dgc(id_me)

@krynju
Copy link
Contributor

krynju commented Oct 4, 2021

Passes locally in multithreaded (-t16) on this branch #42339
Not on master though

Main worker with these threads is involved in that testing, so seems like there's some race on that cleanup

@tkf
Copy link
Member

tkf commented Oct 4, 2021

If you skim Distributed.jl, you can spot a lot of data races (which may or may not be relevant in practice). I wonder if it makes sense to use the experimental pipeline?

@vchuravy
Copy link
Member

vchuravy commented Oct 4, 2021

Yeah making Distributed.jl thread-safe is an ongoing project and an aspiration

@DilumAluthge
Copy link
Member Author

I've made a separate PR with the "loop the test for a couple seconds" suggestion from @vtjnash, with the code from @tkf.

#42499

@vchuravy
Copy link
Member

vchuravy commented Oct 5, 2021

Why does this need to be backported?

@DilumAluthge
Copy link
Member Author

I've been backporting all of the CI PRs to 1.6 and 1.7, because it's the easiest and fastest way to get Buildkite set up across 1.6, 1.7, and master. And if we backport all the PRs, everything backports automatically and cleanly, and there's no need for manual intervention.

Eventually, we will stop backporting CI PRs, and we will start maintaining separate Buildkite configurations on each of the release branches. But I'd like to delay that as much as possible.

But, most importantly, we're not going to start running the Distributed test suite with multithreading on 1.6 or 1.7. We'll only do so starting with 1.8.

The relevant bit from the Buildkite config:

if [[ "$${BUILDKITE_PIPELINE_SLUG:?}"   == "julia-release-1-dot-6" ]]; then
  # On Julia 1.6, the Distributed test suite is expected to fail when multithreading is enabled.
  $${JULIA_BINARY:?} -e 'Base.runtests(["all", "--skip", "Distributed"]; ncores = Sys.CPU_THREADS)'
elif [[ "$${BUILDKITE_PIPELINE_SLUG:?}" == "julia-release-1-dot-7" ]]; then
  # On Julia 1.7, the Distributed test suite is expected to fail when multithreading is enabled.
  $${JULIA_BINARY:?} -e 'Base.runtests(["all", "--skip", "Distributed"]; ncores = Sys.CPU_THREADS)'
else
  $${JULIA_BINARY:?} -e 'Base.runtests(["all"]; ncores = Sys.CPU_THREADS)'
fi

@DilumAluthge DilumAluthge removed the request for review from a team October 5, 2021 21:11
@DilumAluthge DilumAluthge marked this pull request as draft October 5, 2021 21:42
@DilumAluthge DilumAluthge marked this pull request as ready for review October 7, 2021 20:14
@DilumAluthge DilumAluthge force-pushed the dpa/buildkite-distributed-tests-multithreading branch from 4c8e6c2 to 6a0dd48 Compare October 13, 2021 03:38
@DilumAluthge DilumAluthge marked this pull request as draft October 13, 2021 05:57
@DilumAluthge DilumAluthge marked this pull request as ready for review October 14, 2021 22:33
@DilumAluthge DilumAluthge force-pushed the dpa/buildkite-distributed-tests-multithreading branch from 6a0dd48 to 7d09a0e Compare October 14, 2021 22:33
@DilumAluthge
Copy link
Member Author

DilumAluthge commented Oct 15, 2021

So, making the entire Distributed.jl stdlib thread-safe is definitely a long-term project.

This PR is much narrower. The goal of this PR is only to run the Distributed test suite in a setting in which Threads.nthreads() is greater than one.

It looks like the only change necessary to make this work was #42499. The Distributed test suite is now passing on linux64 with export JULIA_NUM_THREADS=16.

I say we merge this and see how it goes. If it ends up causing problems, we can revisit it.


Note: we only run the Distributed test suite with export JULIA_NUM_THREADS=16 on Julia 1.8+. On Julia 1.6 and Julia 1.7, we will skip the Distributed test suite when export JULIA_NUM_THREADS=16.

Note however that we will always run the Distributed test suite when export JULIA_NUM_THREADS=1, regardless of the Julia version.

@DilumAluthge DilumAluthge merged commit 72ec349 into master Oct 15, 2021
@DilumAluthge DilumAluthge deleted the dpa/buildkite-distributed-tests-multithreading branch October 15, 2021 03:04
KristofferC pushed a commit that referenced this pull request Oct 18, 2021
KristofferC pushed a commit that referenced this pull request Oct 19, 2021
KristofferC pushed a commit that referenced this pull request Nov 11, 2021
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Nov 13, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants