-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix advice on how to run code on multiple GPUs #1506
Conversation
The old way should not cause seqfaults (but we should probably use |
Interesting. It reliably segfaults for me on a workload I'm testing with ITensorGPU.jl with the old way, switching to This is probably some weird thing about ITensorGPU.jl then. The only real difference I can find with these various forms is this: julia> @sync begin
@async device!(0) do
outer_dev0 = device()
@async begin
sleep(0.5)
inner_dev0 = device()
@show outer_dev0 inner_dev0
end
end
@async device!(1) do
outer_dev1 = device()
@async begin
sleep(1)
inner_dev1 = device()
@show outer_dev1 inner_dev1
end
end
end;
outer_dev0 = CuDevice(0)
inner_dev0 = CuDevice(0)
outer_dev1 = CuDevice(1)
inner_dev1 = CuDevice(0) versus julia> @sync begin
@async begin
device!(0)
outer_dev0 = device()
@async begin
sleep(0.5)
inner_dev0 = device()
@show outer_dev0 inner_dev0
end
end
@async begin
device!(1)
outer_dev1 = device()
@async begin
sleep(1)
inner_dev1 = device()
@show outer_dev1 inner_dev1
end
end
end;
outer_dev0 = CuDevice(0)
inner_dev0 = CuDevice(1)
outer_dev1 = CuDevice(1)
inner_dev1 = CuDevice(1) I would argue that both behaviours are worringly incorrect, but maybe the way that the incorrect behaviour is manifested in the second case interacts with ITensorsGPU.jl in a more dangerous way? I'm not sure. |
Maybe this is an argument in favour of switching to ContextVariablesX.jl while we wait for JuliaLang/julia#35833 ? julia> using ContextVariablesX
julia> @contextvar dev = 0;
julia> @sync begin
@async with_context(dev => 0) do
outer_dev0 = dev[]
@async begin
sleep(0.5)
inner_dev0 = dev[]
@show outer_dev0 inner_dev0
end
end
@async with_context(dev => 1) do
outer_dev1 = dev[]
@async begin
sleep(1)
inner_dev1 = dev[]
@show outer_dev1 inner_dev1
end
end
end
outer_dev0 = 0
inner_dev0 = 0
outer_dev1 = 1
inner_dev1 = 1
|
Codecov Report
@@ Coverage Diff @@
## master #1506 +/- ##
==========================================
- Coverage 73.35% 72.68% -0.68%
==========================================
Files 131 131
Lines 9825 9825
==========================================
- Hits 7207 7141 -66
- Misses 2618 2684 +66
Continue to review full report at Codecov.
|
Undefined reference exceptions from CUDA.jl? That sounds like a bug, please file an issue. About this issue, setting the device as the first operation in a task should be identical to the ContextVariables would be better, but these task-local look-ups are pretty performance sensitive, so I don't want to penalize them. |
They're not coming from CUDA, it's coming from ITensorGPU.jl, I'll file an issue there. |
The old version caused segfaults on my multi-GPU setup but this one works fine.