[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141

itzsimpl · 2024-01-09T11:50:16Z

Tested on a SLURM 23.11 cluster with Pyxis/Enroot 0.16.1/3.4.1 running NGC containers nemo:23.03, nemo:23.08, ea-bignlp/ga-participants/nemofw-training:23.11, and a custom container built from main branch.

When running single-node multi-gpu training in exclusive mode or without CPU binding, but with OMP_NUM_THREADS set to nCPUs/tasks or 1, the hybrid models (conformer_hybrid, or fastconformer_hybrid) ignore this setting and start too many threads which leads to CPU oversubscription. This results in a performance drop of ~2x. See also https://pytorch.org/docs/stable/notes/multiprocessing.html#avoid-cpu-oversubscription.

More specifically this happens both when running as SLURM sbatch

#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun <train_command>

or as docker command

docker run --rm \
    --net=host --uts=host --ipc=host --security-opt=seccomp=unconfined \
    --ulimit=stack=67108864 --ulimit=memlock=-1 \
    <train_command>

In SLURM the workaround is to use cpu binding and explicitly set the CPU count per task (directive #SBATCH --cpus-per-task). For docker I haven't found a workaround.

Compared to fastconformer_ctc, fastconformer_transducer, conformer_ctc, the issue is present only in the case of hybrid models (conformer or fastconformer). On a DGX-A100 with 256 vCPU and 8 GPU this can be seen by the following vCPU count (as reported by bash command nproc ran as part of the <train_command>) and number of main processes' thread counts (as reported by btop ran on the host). Note that in all cases python calls os.cpu_count() and multiprocessing.cpu_count() ran from within the <train_command> but prior to actual training return 256, and torch.__config__.parallel_info() returns the correct values (as nproc).

conformer_ctc

exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 153]
exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 49]
exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

conformer_hybrid

exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 217]
exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 217]
exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

fastconformer_ctc

exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 153]
exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 49]
exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

fastconformer_transducer

exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 153]
exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 49]
exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

fastconformer_hybrid

exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 217]
exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 217]
exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

Unfortunately I wasn't able to find the exact location where this happens. I may be completely wrong, but one possible cause could be that it is related to decoding, or someplace where torch.set_num_threads() gets reset based on the value of *.cpu_count().

The text was updated successfully, but these errors were encountered:

titu1994 · 2024-01-09T18:06:29Z

Very odd.

Hybrid code doesn't explicitly set num threads anywhere. Transcribe() is also never called during training.

What about num workers on data loader, that's the only place I can think of that spawns workers threads.

itzsimpl · 2024-01-09T22:35:53Z

There are quite a few locations, where num_threads() is updated (not torch explicitly, but numba), and many of them check multiprocessing.cpu_count(), but the real source of the issue here is Numba. More precisely In the case of fastconformer_hybrid it is this section

NeMo/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/gpu_rnnt.py

Lines 85 to 89 in 76a712a

    
           if num_threads > 0: 
        
               numba.set_num_threads(min(multiprocessing.cpu_count(), num_threads)) 
        
               self.num_threads_ = numba.get_num_threads() 
        
           else: 
        
               self.num_threads_ = numba.get_num_threads()

Here the call becomes numba.set_num_threads(1), which resets the torch num_threads. It turns out that Numba resets the torch num_thread even when just reading the Numba num_threads. This can easily be checked with the following snippet:

$ srun --mem=48G -c32 --container-image=./nemo_head.sqfs --pty bash
root@x:/workspace/nemo# nproc
32
root@x:/workspace/nemo# export OMP_NUM_THREADS=8
root@x:/workspace/nemo# nproc
8
root@x:/workspace/nemo# python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing as mp, numba, torch
>>> mp.cpu_count()
256

>>> torch.get_num_threads()
8

>>> print(torch.__config__.parallel_info())
ATen/Parallel:
        at::get_num_threads() : 8
        at::get_num_interop_threads() : 128
OpenMP 201511 (a.k.a. OpenMP 4.5)
        omp_get_max_threads() : 8
Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 8
Intel(R) MKL-DNN v2.7.3 (Git Hash N/A)
std::thread::hardware_concurrency() : 256
Environment variables:
        OMP_NUM_THREADS : 8
        MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

>>> numba.get_num_threads()
32

>>> torch.get_num_threads()
32

>>> print(torch.__config__.parallel_info())
ATen/Parallel:
        at::get_num_threads() : 32
        at::get_num_interop_threads() : 128
OpenMP 201511 (a.k.a. OpenMP 4.5)
        omp_get_max_threads() : 32
Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 32
Intel(R) MKL-DNN v2.7.3 (Git Hash N/A)
std::thread::hardware_concurrency() : 256
Environment variables:
        OMP_NUM_THREADS : 8
        MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

>>> numba.set_num_threads(2)
>>> numba.get_num_threads()
2

>>> torch.get_num_threads()
32

>>> torch.__version__
'2.1.0a0+29c30b1'

>>> numba.__version__
'0.57.1+1.gc785c8f1f'

>>> exit()
root@x:/workspace/nemo# exit
exit

I have opened an issue on numba numba/numba#9387.

itzsimpl · 2024-01-09T22:53:22Z

Until numba fixes the issue one possible solution is to first read the torch num_threads, and reset them back post numba.set_num_threads().

titu1994 · 2024-01-09T23:19:13Z

This isn't specific to hybrid models, I dunno why it shows up only there. In either case, I'll send a PR for temporary fix.

Basically before I call get or set numba threads, I should cache pytorch threads then set that explicitly after numba thread set ?

Temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

temporary fix until numba/numba#9387 gets resolved.

itzsimpl · 2024-01-09T23:37:17Z

I have opened a PR, if there are other locations and you prefer to fix by yourself, I can close it.

titu1994 · 2024-01-09T23:55:59Z

Perfect, let's get yours passing the tests and merged

temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

itzsimpl · 2024-01-17T22:20:51Z

Adding this here just for completnes; found it while researching something else. Lightning 2.1.0 changelog mentions:

If not set by the user, Lightning will set OMP_NUM_THREADS to num_cpus / num_processes when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks (#18677)

titu1994 · 2024-01-17T22:44:38Z

Hmm actually were seeing some segfaults randomly in NeMo ASR and it might be due to the numba fix PR (I saw the segfault first on that pr but thought it was random fluke).

We might have to revert that pr for the time being and depend on numba to fix it properly

itzsimpl · 2024-01-17T23:38:46Z

Hmmm, this is strange. I haven't seen anything like this. During training or inference?

We use pyxis/enroot, which by default injects OMP_NUM_THREADS=1 following PyTorch >1.9.0, which does so if the environment variable is not set. Lightning 2.10, will on the other hand set the env (if not previously declared) as mentioned earlier. I have run quick experiments with OMP_NUM_THREADS=num_cpus / num_processes, but at least on first impression the difference seems negligible.

The good thing is that issues with oversubscription (due to numba resetting the value to min(NUMBA_NUM_THREADS, os.sched_getaffinity(), os.num_cpu())) start arising only when no cpu binding is used. This holds at least on most Linux platforms, but os.sched_getaffinity() is not available on Windows and OsX. There's a comment in torch dataloader about this https://github.com/pytorch/pytorch/blob/763ddb396df4bc14791fbf9149d46d5713a699df/torch/utils/data/dataloader.py#L501-L512

As Numba maintainers pointed out, one approach might be to ensure NUMBA_NUM_THREADS is set to match OMP_NUM_THREADS, as this is the value to which the global OMP num_threads (and accordingly also torch num_threads) will be reset to on the first numba call. With pyxis/enroot the injection is simple, but it is not general for all use cases (e.g. docker, ...).

Yet another approach could be to go the route of Pytorch/Lightning and declare both environment variables (if not already set) during NeMo startup (e.g. in exp_manager).

…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>

github-actions · 2024-02-17T01:43:34Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

itzsimpl · 2024-02-18T22:29:21Z

@titu1994 have you noticed any more segfaults?

titu1994 · 2024-02-18T22:57:57Z

No thankfully none in 1.23 branch and main. Seems to be very random results during that one-two week period

github-actions · 2024-03-21T01:44:20Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-03-29T01:44:06Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

itzsimpl added the bug Something isn't working label Jan 9, 2024

itzsimpl added a commit to itzsimpl/NeMo that referenced this issue Jan 9, 2024

fix: numba.*_num_threads() resets torch.num_threads() NVIDIA#8141

1df7652

Temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

itzsimpl added a commit to itzsimpl/NeMo that referenced this issue Jan 9, 2024

fix: numba.*_num_threads resets torch num_threads NVIDIA#8141

c03bb9c

temporary fix until numba/numba#9387 gets resolved.

itzsimpl added a commit to itzsimpl/NeMo that referenced this issue Jan 10, 2024

fix: numba.*_num_threads resets torch num_threads NVIDIA#8141

14dae6f

temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

itzsimpl mentioned this issue Jan 10, 2024

Wrong number of nproc, when running PyTorch container with cpus-per-task set NVIDIA/enroot#175

Open

titu1994 pushed a commit that referenced this issue Jan 11, 2024

fix: numba.*_num_threads resets torch num_threads #8141 (#8145)

03e7cf1

temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

minitu pushed a commit to minitu/NeMo that referenced this issue Jan 19, 2024

fix: numba.*_num_threads resets torch num_threads NVIDIA#8141 (NVIDIA…

c03d95e

…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

github-actions bot added the stale label Feb 17, 2024

github-actions bot removed the stale label Feb 19, 2024

github-actions bot added the stale label Mar 21, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024

rohitrango pushed a commit to rohitrango/NeMo that referenced this issue Jun 25, 2024

fix: numba.*_num_threads resets torch num_threads NVIDIA#8141 (NVIDIA…

04687f2

…#8145) temporary fix until numba/numba#9387 gets resolved. Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141

[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141

itzsimpl commented Jan 9, 2024

titu1994 commented Jan 9, 2024

itzsimpl commented Jan 9, 2024

itzsimpl commented Jan 9, 2024

titu1994 commented Jan 9, 2024

itzsimpl commented Jan 9, 2024

titu1994 commented Jan 9, 2024

itzsimpl commented Jan 17, 2024

titu1994 commented Jan 17, 2024

itzsimpl commented Jan 17, 2024

github-actions bot commented Feb 17, 2024

itzsimpl commented Feb 18, 2024

titu1994 commented Feb 18, 2024

github-actions bot commented Mar 21, 2024

github-actions bot commented Mar 29, 2024

[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141

[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141

Comments

itzsimpl commented Jan 9, 2024

titu1994 commented Jan 9, 2024

itzsimpl commented Jan 9, 2024

itzsimpl commented Jan 9, 2024

titu1994 commented Jan 9, 2024

itzsimpl commented Jan 9, 2024

titu1994 commented Jan 9, 2024

itzsimpl commented Jan 17, 2024

titu1994 commented Jan 17, 2024

itzsimpl commented Jan 17, 2024

github-actions bot commented Feb 17, 2024

itzsimpl commented Feb 18, 2024

titu1994 commented Feb 18, 2024

github-actions bot commented Mar 21, 2024

github-actions bot commented Mar 29, 2024