-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPENBLAS error in cuda_4.3.3.sif #820
Comments
Full error output is below: r: 2 *** caught segfault *** Traceback: |
@xc308 Can you try to limit by setting See also https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp |
and search for similar issues at https://github.com/OpenMathLib/OpenBLAS/issues. |
@benz0li Hi, I use R not python. I run my Rscript using apptainer |
https://apptainer.org/docs/user/main/environment_and_metadata.html |
Some R packages use Python in the background, e.g. packages tensorflow, torch, etc. What R packages are you using? |
Yeah, this is the what I was just reading, and I think I managed to solve the problem. apptainer exec --nv --env OPENBLAS_NUM_THREADS=1 ../cuda_4.3.3.sif Rscript hello.R Now the code has been running for almost 1 hours and no error so far. |
I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one. |
I also tried to set the OPENBLAS_NUM_THREADS to 5, 10, but all got the same errors. Do you know why only OPENBLAS_NUM_THREADS=1 works? And what will be the impact of setting it to 1? |
You are using R v4.3.3, then. But what packages are you loading with |
I load |
See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script. |
Ah, thank you very much about this useful information! I use the torch_get_num_interop_threads() torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads. However, I'm not entirely understand given my slurm parameter settings: When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads? |
Each CPU has 2 threads on my HPC by the way. |
In additon, I'm thinking if it's the problem of the 72 inter op threads. Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU. I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs. If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS. Please kindly advise. Thank you very much in advance! |
I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error. |
I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error. |
I set interop threads = 2, intra threads =2, OMP_NUM_THREADS=2, OPENBLAS_NUM_THREADS = 2, but got the same error. FYI. |
@cboettig Could you take a look at this? |
I also tried this library(RhpcBLASctl) |
@xc308 can you try this on rocker/rstudio or similar image from the versioned stack for comparison? I'm unclear why you are using the cuda images here. The cuda images should indeed have support for NVBLAS (you have to opt into it and not extensively tested), if you do want to leverage GPU. But unless I'm missing something it seems you are just using CPU with openblas, which should work out of the box and the standard Can you show the output of sessionInfo() as well? Also, please test if openblas is working for you on some standard linear algebra before we worry about the I recommend these examples (which also indicate how to opt in for NVBLAS if you want GPU-accelerated linear algebra -- note that it is not always faster, depends on both your hardware and the overhead in copying data onto GPU...) |
"But unless I'm missing something it seems you are just using CPU with openblas, " No, if my algorithm has 50 steps, the first 25 steps are done on CPU, but the rest of 25 steps are offloaded to GPU, so I do need the cuda image here. "Can you show the output of sessionInfo() as well?" Check Current BLAS Library Matrix products: default locale: time zone: Etc/UTC attached base packages: other attached packages: loaded via a namespace (and not attached): "please test if openblas is working for you on some standard linear algebra " I did test on the openblas, it GPU node is required, the blas threads will automatically be 36 (the same as intra op threads). In such case, I have to set the env var OPENBLAS_NUM_THREADS to 1, any other number will throw me the same error as reported above. "if you want GPU-accelerated linear algebra" Since the first 25 steps of algorithm involves few loops, so it's not most ideal to offload them to GPU but instead leave them stay on CPU. That's why I'm thinking to increase the BLAS threads to try to speed up the calculation of this part. |
@xc308 thanks. I understand you are running a complex algorithm with many steps and it is not working as expected. When trying to debug code, it is helpful to try and reproduce the problem with a minimal example rather than attempt to debug a complex algorithm with many steps and interleaved CPU & GPU dispatch. Please see the simple matrix multiplication examples in the tests I linked above, and see if they are working as expected. If they are not, we can try and debug. If they are working as expected for you on both standard and cuda images, then we will need to further isolate the issue, as it is not specifically an issue with openblas configuration. If that is the case, then please proceed to identify a minimal reproducible example that we can run to generate the behavior you are seeing. Hope this helps. |
Container image name
rocker/cuda:4.3.3
Container image digest
No response
What operating system are you seeing the problem on?
Linux
System information
Linux bask-pg-login01.cluster.baskerville.ac.uk 4.18.0-513.11.1.el8_9.x86_64 GPU versioning #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
[fwzp1184@bask-pg-login01 XC_Work]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 144
On-line CPU(s) list: 0-143
Thread(s) per core: 2
Core(s) per socket: 36
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
Stepping: 6
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 55296K
NUMA node0 CPU(s): 0-35,72-107
NUMA node1 CPU(s): 36-71,108-143
[fwzp1184@bask-pg-login01 XC_Work]$ cat /proc/meminfo
MemTotal: 527954288 kB
MemFree: 502250764 kB
MemAvailable: 499490632 kB
Buffers: 5284 kB
Cached: 5043180 kB
SwapCached: 21844 kB
Active: 4296360 kB
Inactive: 9620012 kB
Active(anon): 3348860 kB
Inactive(anon): 9000292 kB
Active(file): 947500 kB
Inactive(file): 619720 kB
Unevictable: 4207544 kB
Mlocked: 4207544 kB
SwapTotal: 33554428 kB
SwapFree: 32450556 kB
Dirty: 188 kB
Writeback: 0 kB
AnonPages: 13041088 kB
Mapped: 3214484 kB
Shmem: 3476772 kB
KReclaimable: 1094028 kB
Slab: 2450656 kB
SReclaimable: 1094028 kB
SUnreclaim: 1356628 kB
KernelStack: 62560 kB
PageTables: 193820 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 297531572 kB
Committed_AS: 14419992 kB
VmallocTotal: 13743895347199 kB
VmallocUsed: 3079888 kB
VmallocChunk: 0 kB
Percpu: 372672 kB
HardwareCorrupted: 0 kB
AnonHugePages: 7432192 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 4634816 kB
DirectMap2M: 144955392 kB
DirectMap1G: 389021696 kB
Bug description
I recently encountered a strange error when I submitted my job to HPC, saying,
"OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
With a larger NUM_THREADS value or, set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more CPU cores than what OpenBLAS was configured to handle."
I have never encountered such an error when I use simulation data of size 200*5, and the precision matrices are 200*5 by 200*5. But I get this error when I use actual data of size around 3800*5 by 3800*5.
My code offloads giant matrices multiplications to 1 GPU node and will only return the neg-log likelihood scalar, whose calculation processes are all on GPU, back to CPU for the following optimization.
After encountering such an error, I followed the instructions of the error and set the environment variable at the beginning of my R scripts. I have tried to set
Sys.setenv(OPENBLAS_NUM_THREADS = "126")
Sys.setenv(OPENBLAS_NUM_THREADS = "1")
but they all gave me exactly the same error as those mentioned above.
When I tried Sys.getenv("OPENBLAS_NUM_THREADS"), I got an empty result, [1] "".
So, I'm wondering whether the OPENBLAS library enclosed in the cuda/4.3.3.sif will ever honour the environment variable OPENBLAS_NUM_THREADS. It gave me a feeling that OPENBLAS won't change its threads no matter how small I set the environment variable.
In the terminal, I typed echo $OPENBLAS_NUM_THREADS and got 120.
In my slurm job description, I set job allocation parameters as below:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-gpu=36
And the Rscript run command is:
apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R
How to reproduce this bug?
The text was updated successfully, but these errors were encountered: