Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openmpi bugs #31

Closed
wants to merge 63 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
23b59d3
Keeping only the suspected faulty test (by commenting out rest (!))
alexandrebouchard Mar 6, 2023
40001d0
Removing Turing from test to isolate only one error at the time
alexandrebouchard Mar 6, 2023
ff5aac9
Check behaviour on different threadlevels
alexandrebouchard Mar 6, 2023
d00ff53
Back to same threadlevel as not changing crashing behaviour
alexandrebouchard Mar 6, 2023
69bb270
More conservative tag ub?
alexandrebouchard Mar 6, 2023
d94252f
Revert "More conservative tag ub?"
alexandrebouchard Mar 6, 2023
bcf016d
Trying on OpenMPI instead of mpich
alexandrebouchard Mar 6, 2023
379df4b
Going back to mpich, trying threadlevel = :multiple
alexandrebouchard Mar 7, 2023
2cd3571
Back to funneled as :multiple still crashes
alexandrebouchard Mar 7, 2023
c8a0261
Further simplification
alexandrebouchard Mar 7, 2023
6217650
Candidate fix
alexandrebouchard Mar 8, 2023
1f16936
Reintroducing other tests
alexandrebouchard Mar 8, 2023
0ca198f
purge tests from Turing, use only DynamicPPL
miguelbiron Mar 8, 2023
ee276fc
Another free
alexandrebouchard Mar 8, 2023
491c26b
Trying to run tests with more rounds
alexandrebouchard Mar 8, 2023
fab7934
system MPI test
miguelbiron Mar 9, 2023
2ed0aae
fix typos
miguelbiron Mar 9, 2023
c271261
remove step that does not make sense outside MPI.jl
miguelbiron Mar 9, 2023
e3f569a
setup MPIPreferences in the test env
miguelbiron Mar 9, 2023
77824ea
possible fix to Pkg missing
miguelbiron Mar 9, 2023
c69b76a
possible fix to Pkg missing
miguelbiron Mar 9, 2023
523f9c1
add missing env
miguelbiron Mar 9, 2023
949b628
another try
miguelbiron Mar 9, 2023
895a45f
test now has its own Project.toml
miguelbiron Mar 9, 2023
13c9cd2
force Pkg.instantiate in ChildProcess
miguelbiron Mar 9, 2023
1772d82
dont use --project in ChildProcess
miguelbiron Mar 9, 2023
bdf71ce
ChildProcess inherits the exact same active project
miguelbiron Mar 9, 2023
2bcd825
new approach
miguelbiron Mar 9, 2023
eadbfcd
Changing approach for Isend/isend: explicit Waitall for all requests
alexandrebouchard Mar 9, 2023
57d0921
Is this a Turing multi-threading issue?
alexandrebouchard Mar 9, 2023
63aad97
Testing single-thread for all
alexandrebouchard Mar 9, 2023
7a8de7e
Adding a temporary "dry run" to see if problem cause by some compilation
alexandrebouchard Mar 9, 2023
5812127
Commenting out Turing and Blang, to see behaviour on toy_mvn and swap…
alexandrebouchard Mar 9, 2023
270333f
openmpi_jll test
miguelbiron Mar 9, 2023
2b49008
openmpi_jll test
miguelbiron Mar 9, 2023
9c6497f
openmpi_jll test
miguelbiron Mar 9, 2023
d68dc6c
openmpi_jll test
miguelbiron Mar 9, 2023
38de7cf
more messages from Child processes
miguelbiron Mar 9, 2023
8537307
extra arg to OpenMPI
miguelbiron Mar 9, 2023
c2b732b
rm mpi version query
miguelbiron Mar 9, 2023
0e82a84
use old julia_cmd + rm manifest as in MPI.jl CI
miguelbiron Mar 9, 2023
82a4210
add --oversubscribe for OpenMPI + rethrow exception
miguelbiron Mar 9, 2023
700c712
simplify logging
miguelbiron Mar 9, 2023
ebcb3a6
force instantiate + precompile
miguelbiron Mar 9, 2023
ad69b34
Add mpi_args to mpi_test
alexandrebouchard Mar 10, 2023
6bde8fe
Temporary: trying to speed up some key tests
alexandrebouchard Mar 10, 2023
7b5e7d4
Fix
alexandrebouchard Mar 10, 2023
fc34b85
Fix the fix + reintroducing the system-MPI tests
alexandrebouchard Mar 10, 2023
5ba839e
Add back libmpich-dev to resume investigation on ghostbug
alexandrebouchard Mar 10, 2023
2b37c0f
Trying to simplify CI setup needed to reproduce ghostbug
alexandrebouchard Mar 10, 2023
0847901
Fix last commit
alexandrebouchard Mar 10, 2023
f75a280
toy_mvn not enough to manifest ghostbug, trying Turing
alexandrebouchard Mar 10, 2023
93033fc
test mpich+openmpi using brew
miguelbiron Mar 10, 2023
d46d429
fix wrong abi detection for mpich
miguelbiron Mar 10, 2023
7a11074
move xtra args to MPI struct + remove prints + failsafe for empty cur…
miguelbiron Mar 10, 2023
d89911a
mpiexec args for childprocess
miguelbiron Mar 10, 2023
aa233e8
re-introduce all CI tests + fix bug in building mpi cmd
miguelbiron Mar 11, 2023
42993fd
add support for using without Project.toml
miguelbiron Mar 11, 2023
42bdd75
add comment explaining why we wait on Isend
miguelbiron Mar 11, 2023
c5cb70a
mpiexec_args is a Cmd now
miguelbiron Mar 11, 2023
9e31b49
re-instate all tests
miguelbiron Mar 11, 2023
f93548d
Do we have the same bug with 1 thread
alexandrebouchard Mar 12, 2023
495b4eb
Explicitly disable GC in all pigeons runs
alexandrebouchard Mar 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 150 additions & 3 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
name: CI

on:
push:
branches:
- main
tags: ['*']
pull_request:

concurrency:
# Skip intermediate builds: always.
# Cancel intermediate builds: only if it is a pull request build.
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ startsWith(github.ref, 'refs/pull/') }}

jobs:
test:

# default test
test-MPICH-jll:
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
runs-on: ${{ matrix.os }}
strategy:
Expand All @@ -27,7 +32,7 @@ jobs:
arch:
- x64
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- uses: actions/setup-java@v3
with:
distribution: 'temurin'
Expand All @@ -43,13 +48,155 @@ jobs:
- uses: codecov/codecov-action@v2
with:
files: lcov.info


# test OpenMPI by requesting it with MPIPreferences
# adapted from MPI.jl
test-OpenMPI-jll:
name: Julia OpenMPI_jll - ${{ github.event_name }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
version:
- '1.8'
os:
- ubuntu-latest
arch:
- x64

fail-fast: false
env:
JULIA_MPI_TEST_BINARY: OpenMPI_jll
JULIA_MPI_TEST_ABI: OpenMPI
steps:
- name: Checkout
uses: actions/checkout@v3

- uses: julia-actions/setup-julia@latest
with:
version: ${{ matrix.version }}
arch: ${{ matrix.arch }}
- uses: julia-actions/cache@v1

- name: use OpenMPI_jll
shell: julia --color=yes --project=test {0}
run: |
using Pkg
Pkg.instantiate()
using MPIPreferences
MPIPreferences.use_jll_binary("OpenMPI_jll")
rm("test/Manifest.toml")

- uses: julia-actions/julia-runtest@latest

# test system MPI using Brew in macOS
# adapted from MPI.jl
test-system-MPI-brew:
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.mpi }} - ${{ github.event_name }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
version:
- '1.8'
os:
- macos-latest
mpi:
- mpich
- openmpi
env:
JULIA_MPI_TEST_BINARY: system
ZES_ENABLE_SYSMAN: 1 # https://github.com/open-mpi/ompi/issues/10142
steps:
- uses: actions/checkout@v3

- name: Install MPI via homebrew
run: brew install $MPI
env:
MPI: ${{ matrix.mpi }}

- uses: actions/setup-java@v3
with:
distribution: 'temurin'
java-version: '11'

- uses: julia-actions/setup-julia@v1
with:
version: ${{ matrix.version }}
arch: x64

- uses: julia-actions/cache@v1

- name: use system MPI
shell: julia --color=yes --project=test {0}
run: |
using Pkg
Pkg.instantiate()
using MPIPreferences
MPIPreferences.use_system_binary()
run(`sed -i.bu 's/unknown/MPICH/' test/LocalPreferences.toml`) # fix wrong abi detection for mpich
rm("test/Manifest.toml")

- uses: julia-actions/julia-runtest@v1

# # test system MPI using apt in ubuntu
# # adapted from MPI.jl
# # TODO: commented out because apt has older versions of MPICH and OMPI that
# # segfault with multithreading. Re-introduce them when apt pkgs are upgraded
# test-system-MPI-apt:
# name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.mpi }} - ${{ github.event_name }}
# runs-on: ${{ matrix.os }}
# strategy:
# fail-fast: false
# matrix:
# version:
# - '1.8'
# os:
# - ubuntu-latest
# mpi:
# - libmpich-dev
# env:
# JULIA_MPI_TEST_BINARY: system
# steps:
# - uses: actions/checkout@v3

# - name: Install MPI via apt
# run: |
# sudo apt-get update
# sudo apt-get install $MPI
# env:
# MPI: ${{ matrix.mpi }}

# - uses: actions/setup-java@v3
# with:
# distribution: 'temurin'
# java-version: '11'

# - uses: julia-actions/setup-julia@v1
# with:
# version: ${{ matrix.version }}
# arch: x64

# - uses: julia-actions/cache@v1

# - name: use system MPI
# shell: julia --color=yes --project=test {0}
# run: |
# using Pkg
# Pkg.instantiate()
# using MPIPreferences
# MPIPreferences.use_system_binary()
# rm("test/Manifest.toml")

# - uses: julia-actions/julia-runtest@v1

docs:
name: Documentation
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- uses: actions/setup-java@v3
with:
distribution: 'temurin'
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
.vscode/settings.json
build
.interfaces.md

*.log
*.err
machines.txt
results
.includes_bu.jl
Expand Down
6 changes: 0 additions & 6 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,3 @@ SpecialFunctions = "2"
SplittableRandoms = "0.1"
StatsBase = "0.33"
julia = "1.6"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
2 changes: 1 addition & 1 deletion src/Pigeons.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import MPI: Comm, Allreduce, Comm_rank,
Comm_dup, Request, Waitall,
RequestSet, mpiexec, Allreduce,
Allgather, Comm_split, isend, recv,
bcast, tag_ub
bcast, tag_ub


using Base: Forward
Expand Down
20 changes: 15 additions & 5 deletions src/mpi_utils/Entangler.jl
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ mutable struct Entangler
println("Entangler initialized 1 process (without MPI); $(Threads.nthreads())")
end
else
Init(threadlevel = :funneled)
init_mpi()
comm = Comm_dup(parent_communicator)
transmit_counter_bound = ceil(Int, tag_ub() / n_global_indices - 2)
my_process_index = Comm_rank(comm) + 1
Expand Down Expand Up @@ -101,10 +101,12 @@ mpi_active() =
if silence_mpi[]
false
else
Init(threadlevel = :funneled)
init_mpi()
Comm_size(COMM_WORLD) > 1
end

init_mpi() = Init(threadlevel = :funneled)

"""
$SIGNATURES

Expand Down Expand Up @@ -155,6 +157,8 @@ function transmit!(e::Entangler, source_data::AbstractVector{T}, to_global_indic
# indicators of whether each local index is to be received over MPI
e.current_received_bits .= true
at_least_one_mpi = false

requests = RequestSet() # non-blocking requests that will be waited on

# send (or copy if local)
for local_index in 1:myload
Expand All @@ -172,13 +176,15 @@ function transmit!(e::Entangler, source_data::AbstractVector{T}, to_global_indic
source_view = Ref{T}(source_datum)
mpi_rank = process_index - 1
# asynchronously (non-blocking) send over MPI:
Isend(source_view, e.communicator, dest = mpi_rank, tag = tag(e, transmit_index, global_index))
# note: we wait for the Isend request to avoid the application
# terminating in the last iteration without completing its request.
request = Isend(source_view, e.communicator, dest = mpi_rank, tag = tag(e, transmit_index, global_index))
push!(requests, request)
end
end

# receive
if at_least_one_mpi
requests = RequestSet()
my_globals = my_global_indices(e.load)
for local_index in 1:myload
if e.current_received_bits[local_index]
Expand Down Expand Up @@ -238,6 +244,8 @@ function reduce_deterministically(operation, source_data::AbstractVector{T}, e::
# outer loop is over the levels of a binary tree over the global indices
iteration = 1

requests = RequestSet()

while n_remaining_to_reduce > 1
transmit_index = next_transmit_index!(e)
current_local = my_first_remaining_local
Expand All @@ -251,7 +259,8 @@ function reduce_deterministically(operation, source_data::AbstractVector{T}, e::
dest_global_index = current_global - spacing
dest_process = find_process(e.load, dest_global_index)
dest_rank = dest_process - 1
isend(work_array[current_local], e.communicator; dest = dest_rank, tag = tag(e, transmit_index, iteration))
request = isend(work_array[current_local], e.communicator; dest = dest_rank, tag = tag(e, transmit_index, iteration))
push!(requests, request)
current_local += spacing
did_send = true
elseif current_global + spacing ≤ e.load.n_global_indices
Expand All @@ -274,6 +283,7 @@ function reduce_deterministically(operation, source_data::AbstractVector{T}, e::

if did_send
my_first_remaining_local += spacing
Waitall(requests)
end
n_global_indices_remaining_before = ceil(Int, n_global_indices_remaining_before/2)
spacing = spacing * 2
Expand Down
2 changes: 2 additions & 0 deletions src/pt/pigeons.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ This will also call [`report()`](@ref), [`write_checkpoint()`](@ref),
and [`run_checks()`](@ref) between rounds.
"""
function pigeons(pt::PT)
@warn "TEMPORARY - GC DISABLED FOR TESTING"
GC.enable(false)
preflight_checks(pt)
while next_round!(pt) # NB: while-loop instead of for-loop to support resuming from checkpoint
reduced_recorders = run_one_round!(pt)
Expand Down
35 changes: 23 additions & 12 deletions src/submission/ChildProcess.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ $FIELDS
(if of type `String`) needed by the child
process.
"""
dependencies::Vector{Any} = []
dependencies::Vector = []
# eventually, detect & save which
# modules should be loaded? E.g. could use
# https://stackoverflow.com/questions/25575406/list-of-loaded-imported-packages-in-julia
Expand All @@ -34,15 +34,22 @@ $FIELDS
third-party target distribution which somehow
does not support multi-threading.
"""
n_local_mpi_processes = 1
n_local_mpi_processes::Int = 1

"""
If wait is false, the process runs asynchronously.
When wait is false, the process' I/O streams are directed to devnull.
"""
wait = true
wait::Bool = true

"""
Extra arguments passed to mpiexec.
"""
mpiexec_args::Cmd = ``
end



"""
$SIGNATURES

Expand All @@ -62,21 +69,26 @@ function pigeons(pt_arguments, new_process::ChildProcess)
run(julia_cmd, wait = new_process.wait)
else
mpiexec() do exe
mpi_cmd = `$exe -n $(new_process.n_local_mpi_processes)`
cmd = `$mpi_cmd $julia_cmd`
mpi_cmd = `$exe $(new_process.mpiexec_args) -n $(new_process.n_local_mpi_processes)`
cmd = `$mpi_cmd $julia_cmd`
run(cmd, wait = new_process.wait)
end
end
return Result{PT}(exec_folder)
end

function launch_cmd(pt_arguments, exec_folder, dependencies, n_threads::Int, silence_mpi::Bool)
julia_bin = Base.julia_cmd()
script_path = launch_script(pt_arguments, exec_folder, dependencies, silence_mpi)
return `$julia_bin
--project
--threads=$n_threads
$script_path`
script_path = launch_script(pt_arguments, exec_folder, dependencies, silence_mpi)
jl_cmd = Base.julia_cmd()
project_file = Base.current_project()
if !isnothing(project_file)
# forcing instantiate the project to make sure dependencies exist
# also, precompile to avoid issues with coordinating access to compile cache
project_dir = dirname(project_file)
jl_cmd = `$jl_cmd --project=$project_dir`
run(`$jl_cmd -e "using Pkg; Pkg.instantiate(); Pkg.precompile()"`)
end
return `$jl_cmd --threads=$n_threads $script_path`
end

function launch_script(pt_arguments, exec_folder, dependencies, silence_mpi)
Expand Down Expand Up @@ -129,7 +141,6 @@ function launch_code(

Pigeons.deserialize_immutables(raw"$path_to_serialized_immutables")
pt_arguments = deserialize(raw"$path_to_serialized_pt_arguments")

pt = PT(pt_arguments, exec_folder = raw"$exec_folder")
pigeons(pt)
"""
Expand Down
Loading