Out of Memory when working with Distributed for Small Matricies #2548

jarbus · 2024-11-07T22:49:46Z

Describe the bug

I'm performing matrix multiplication using multiple workers on a GPU, using Distributed. I limit the memory usage of CUDA using the env variables per the documentation, but oddly, these variables only seem to be effective when using large matrices of 1024x1024. The code below seems to ignore the memory restrictions, uses all available memory, and eventually OOMs due to what I believe is a race condition. Changing the 128 to 1024 in the following code results in each process restricting itself to just 2x the memory limit (around 10%), which prevents an OOM on my machine.

To reproduce

using Distributed
env = [
    "JULIA_CUDA_HARD_MEMORY_LIMIT" => "5%",
    "JULIA_CUDA_MEMORY_POOL" => "none"
]

n_workers = 6
addprocs(n_workers, env=env)

@everywhere begin
using CUDA
function matrix_multiply_on_gpu(worker_id)
    A = CUDA.rand(Float32, 128, 128)
    B = CUDA.rand(Float32, 128, 128)
    C = A * B

    return sum(C)
end
end
for i in 1:100_000
    pmap(matrix_multiply_on_gpu, 1:n_workers)
end

Manifest.toml

Using an environment with only CUDA.jl#master installed, will update with a Manifest.toml if needed

Expected behavior

I expect each process to limit itself to 5% of the CPU memory, or at least, some maximum amount.

Version info

Details on Julia:


Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  LD_LIBRARY_PATH =

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A4000 (sm_86, 1.528 GiB / 15.992 GiB available)

Additional context

This is for a research project, where I want to distribute a workload across many processes on multiple machines, each process utilizing a small amount of one of the GPUs.

The text was updated successfully, but these errors were encountered:

jarbus added the bug Something isn't working label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory when working with Distributed for Small Matricies #2548

Out of Memory when working with Distributed for Small Matricies #2548

jarbus commented Nov 7, 2024

Out of Memory when working with Distributed for Small Matricies #2548

Out of Memory when working with Distributed for Small Matricies #2548

Comments

jarbus commented Nov 7, 2024