-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141
Comments
Very odd. Hybrid code doesn't explicitly set num threads anywhere. Transcribe() is also never called during training. What about num workers on data loader, that's the only place I can think of that spawns workers threads. |
There are quite a few locations, where NeMo/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/gpu_rnnt.py Lines 85 to 89 in 76a712a
Here the call becomes $ srun --mem=48G -c32 --container-image=./nemo_head.sqfs --pty bash
root@x:/workspace/nemo# nproc
32
root@x:/workspace/nemo# export OMP_NUM_THREADS=8
root@x:/workspace/nemo# nproc
8
root@x:/workspace/nemo# python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing as mp, numba, torch
>>> mp.cpu_count()
256
>>> torch.get_num_threads()
8
>>> print(torch.__config__.parallel_info())
ATen/Parallel:
at::get_num_threads() : 8
at::get_num_interop_threads() : 128
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 8
Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 8
Intel(R) MKL-DNN v2.7.3 (Git Hash N/A)
std::thread::hardware_concurrency() : 256
Environment variables:
OMP_NUM_THREADS : 8
MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
>>> numba.get_num_threads()
32
>>> torch.get_num_threads()
32
>>> print(torch.__config__.parallel_info())
ATen/Parallel:
at::get_num_threads() : 32
at::get_num_interop_threads() : 128
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 32
Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 32
Intel(R) MKL-DNN v2.7.3 (Git Hash N/A)
std::thread::hardware_concurrency() : 256
Environment variables:
OMP_NUM_THREADS : 8
MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
>>> numba.set_num_threads(2)
>>> numba.get_num_threads()
2
>>> torch.get_num_threads()
32
>>> torch.__version__
'2.1.0a0+29c30b1'
>>> numba.__version__
'0.57.1+1.gc785c8f1f'
>>> exit()
root@x:/workspace/nemo# exit
exit I have opened an issue on numba numba/numba#9387. |
Until numba fixes the issue one possible solution is to first read the torch num_threads, and reset them back post numba.set_num_threads(). |
This isn't specific to hybrid models, I dunno why it shows up only there. In either case, I'll send a PR for temporary fix. Basically before I call get or set numba threads, I should cache pytorch threads then set that explicitly after numba thread set ? |
Temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
temporary fix until numba/numba#9387 gets resolved.
I have opened a PR, if there are other locations and you prefer to fix by yourself, I can close it. |
Perfect, let's get yours passing the tests and merged |
temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
Adding this here just for completnes; found it while researching something else. Lightning 2.1.0 changelog mentions:
|
Hmm actually were seeing some segfaults randomly in NeMo ASR and it might be due to the numba fix PR (I saw the segfault first on that pr but thought it was random fluke). We might have to revert that pr for the time being and depend on numba to fix it properly |
Hmmm, this is strange. I haven't seen anything like this. During training or inference? We use pyxis/enroot, which by default injects The good thing is that issues with oversubscription (due to numba resetting the value to As Numba maintainers pointed out, one approach might be to ensure Yet another approach could be to go the route of Pytorch/Lightning and declare both environment variables (if not already set) during NeMo startup (e.g. in exp_manager). |
…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
@titu1994 have you noticed any more segfaults? |
No thankfully none in 1.23 branch and main. Seems to be very random results during that one-two week period |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
Tested on a SLURM 23.11 cluster with Pyxis/Enroot 0.16.1/3.4.1 running NGC containers
nemo:23.03
,nemo:23.08
,ea-bignlp/ga-participants/nemofw-training:23.11
, and a custom container built from main branch.When running single-node multi-gpu training in exclusive mode or without CPU binding, but with
OMP_NUM_THREADS
set to nCPUs/tasks or 1, the hybrid models (conformer_hybrid, or fastconformer_hybrid) ignore this setting and start too many threads which leads to CPU oversubscription. This results in a performance drop of ~2x. See also https://pytorch.org/docs/stable/notes/multiprocessing.html#avoid-cpu-oversubscription.More specifically this happens both when running as SLURM sbatch
or as docker command
In SLURM the workaround is to use cpu binding and explicitly set the CPU count per task (directive
#SBATCH --cpus-per-task
). For docker I haven't found a workaround.Compared to fastconformer_ctc, fastconformer_transducer, conformer_ctc, the issue is present only in the case of hybrid models (conformer or fastconformer). On a DGX-A100 with 256 vCPU and 8 GPU this can be seen by the following vCPU count (as reported by bash command
nproc
ran as part of the<train_command>
) and number of main processes' thread counts (as reported by btop ran on the host). Note that in all cases python callsos.cpu_count()
andmultiprocessing.cpu_count()
ran from within the<train_command>
but prior to actual training return 256, andtorch.__config__.parallel_info()
returns the correct values (asnproc
).conformer_ctc
conformer_hybrid
fastconformer_ctc
fastconformer_transducer
fastconformer_hybrid
Unfortunately I wasn't able to find the exact location where this happens. I may be completely wrong, but one possible cause could be that it is related to decoding, or someplace where
torch.set_num_threads()
gets reset based on the value of*.cpu_count()
.The text was updated successfully, but these errors were encountered: