Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory when working with Distributed for Small Matricies #2548

Open
jarbus opened this issue Nov 7, 2024 · 0 comments
Open

Out of Memory when working with Distributed for Small Matricies #2548

jarbus opened this issue Nov 7, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@jarbus
Copy link

jarbus commented Nov 7, 2024

Describe the bug

I'm performing matrix multiplication using multiple workers on a GPU, using Distributed. I limit the memory usage of CUDA using the env variables per the documentation, but oddly, these variables only seem to be effective when using large matrices of 1024x1024. The code below seems to ignore the memory restrictions, uses all available memory, and eventually OOMs due to what I believe is a race condition. Changing the 128 to 1024 in the following code results in each process restricting itself to just 2x the memory limit (around 10%), which prevents an OOM on my machine.

To reproduce

using Distributed
env = [
    "JULIA_CUDA_HARD_MEMORY_LIMIT" => "5%",
    "JULIA_CUDA_MEMORY_POOL" => "none"
]

n_workers = 6
addprocs(n_workers, env=env)

@everywhere begin
using CUDA
function matrix_multiply_on_gpu(worker_id)
    A = CUDA.rand(Float32, 128, 128)
    B = CUDA.rand(Float32, 128, 128)
    C = A * B

    return sum(C)
end
end
for i in 1:100_000
    pmap(matrix_multiply_on_gpu, 1:n_workers)
end
Manifest.toml

Using an environment with only CUDA.jl#master installed, will update with a Manifest.toml if needed

Expected behavior

I expect each process to limit itself to 5% of the CPU memory, or at least, some maximum amount.

Version info

Details on Julia:


Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  LD_LIBRARY_PATH = 

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A4000 (sm_86, 1.528 GiB / 15.992 GiB available)

Additional context

This is for a research project, where I want to distribute a workload across many processes on multiple machines, each process utilizing a small amount of one of the GPUs.

@jarbus jarbus added the bug Something isn't working label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant