Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASR] hybrid models ignore OMP_NUM_THREADS/torch.set_num_threads() setting #8141

Closed
itzsimpl opened this issue Jan 9, 2024 · 14 comments
Closed
Labels
bug Something isn't working stale

Comments

@itzsimpl
Copy link
Contributor

itzsimpl commented Jan 9, 2024

Tested on a SLURM 23.11 cluster with Pyxis/Enroot 0.16.1/3.4.1 running NGC containers nemo:23.03, nemo:23.08, ea-bignlp/ga-participants/nemofw-training:23.11, and a custom container built from main branch.

When running single-node multi-gpu training in exclusive mode or without CPU binding, but with OMP_NUM_THREADS set to nCPUs/tasks or 1, the hybrid models (conformer_hybrid, or fastconformer_hybrid) ignore this setting and start too many threads which leads to CPU oversubscription. This results in a performance drop of ~2x. See also https://pytorch.org/docs/stable/notes/multiprocessing.html#avoid-cpu-oversubscription.

More specifically this happens both when running as SLURM sbatch

#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun <train_command>

or as docker command

docker run --rm \
    --net=host --uts=host --ipc=host --security-opt=seccomp=unconfined \
    --ulimit=stack=67108864 --ulimit=memlock=-1 \
    <train_command>

In SLURM the workaround is to use cpu binding and explicitly set the CPU count per task (directive #SBATCH --cpus-per-task). For docker I haven't found a workaround.

Compared to fastconformer_ctc, fastconformer_transducer, conformer_ctc, the issue is present only in the case of hybrid models (conformer or fastconformer). On a DGX-A100 with 256 vCPU and 8 GPU this can be seen by the following vCPU count (as reported by bash command nproc ran as part of the <train_command>) and number of main processes' thread counts (as reported by btop ran on the host). Note that in all cases python calls os.cpu_count() and multiprocessing.cpu_count() ran from within the <train_command> but prior to actual training return 256, and torch.__config__.parallel_info() returns the correct values (as nproc).

conformer_ctc

  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 153]
  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 49]
  • exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

conformer_hybrid

  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 217]
  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 217]
  • exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

fastconformer_ctc

  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 153]
  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 49]
  • exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

fastconformer_transducer

  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 153]
  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 49]
  • exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

fastconformer_hybrid

  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS env not set [vCPU=192; nthreads: 217]
  • exclusive mode, cpus-per-task not set, OMP_NUM_THREADS=24 env set [vCPU=24, nthreads: 217]
  • exclusive mode, set cpus-per-tasks on srun, OMP_NUM_THREADS=32 env set [vCPU=32, nthreads: 57]

Unfortunately I wasn't able to find the exact location where this happens. I may be completely wrong, but one possible cause could be that it is related to decoding, or someplace where torch.set_num_threads() gets reset based on the value of *.cpu_count().

@itzsimpl itzsimpl added the bug Something isn't working label Jan 9, 2024
@titu1994
Copy link
Collaborator

titu1994 commented Jan 9, 2024

Very odd.

Hybrid code doesn't explicitly set num threads anywhere. Transcribe() is also never called during training.

What about num workers on data loader, that's the only place I can think of that spawns workers threads.

@itzsimpl
Copy link
Contributor Author

itzsimpl commented Jan 9, 2024

There are quite a few locations, where num_threads() is updated (not torch explicitly, but numba), and many of them check multiprocessing.cpu_count(), but the real source of the issue here is Numba. More precisely In the case of fastconformer_hybrid it is this section

if num_threads > 0:
numba.set_num_threads(min(multiprocessing.cpu_count(), num_threads))
self.num_threads_ = numba.get_num_threads()
else:
self.num_threads_ = numba.get_num_threads()

Here the call becomes numba.set_num_threads(1), which resets the torch num_threads. It turns out that Numba resets the torch num_thread even when just reading the Numba num_threads. This can easily be checked with the following snippet:

$ srun --mem=48G -c32 --container-image=./nemo_head.sqfs --pty bash
root@x:/workspace/nemo# nproc
32
root@x:/workspace/nemo# export OMP_NUM_THREADS=8
root@x:/workspace/nemo# nproc
8
root@x:/workspace/nemo# python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing as mp, numba, torch
>>> mp.cpu_count()
256

>>> torch.get_num_threads()
8

>>> print(torch.__config__.parallel_info())
ATen/Parallel:
        at::get_num_threads() : 8
        at::get_num_interop_threads() : 128
OpenMP 201511 (a.k.a. OpenMP 4.5)
        omp_get_max_threads() : 8
Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 8
Intel(R) MKL-DNN v2.7.3 (Git Hash N/A)
std::thread::hardware_concurrency() : 256
Environment variables:
        OMP_NUM_THREADS : 8
        MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

>>> numba.get_num_threads()
32

>>> torch.get_num_threads()
32

>>> print(torch.__config__.parallel_info())
ATen/Parallel:
        at::get_num_threads() : 32
        at::get_num_interop_threads() : 128
OpenMP 201511 (a.k.a. OpenMP 4.5)
        omp_get_max_threads() : 32
Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 32
Intel(R) MKL-DNN v2.7.3 (Git Hash N/A)
std::thread::hardware_concurrency() : 256
Environment variables:
        OMP_NUM_THREADS : 8
        MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

>>> numba.set_num_threads(2)
>>> numba.get_num_threads()
2

>>> torch.get_num_threads()
32

>>> torch.__version__
'2.1.0a0+29c30b1'

>>> numba.__version__
'0.57.1+1.gc785c8f1f'

>>> exit()
root@x:/workspace/nemo# exit
exit

I have opened an issue on numba numba/numba#9387.

@itzsimpl
Copy link
Contributor Author

itzsimpl commented Jan 9, 2024

Until numba fixes the issue one possible solution is to first read the torch num_threads, and reset them back post numba.set_num_threads().

@titu1994
Copy link
Collaborator

titu1994 commented Jan 9, 2024

This isn't specific to hybrid models, I dunno why it shows up only there. In either case, I'll send a PR for temporary fix.

Basically before I call get or set numba threads, I should cache pytorch threads then set that explicitly after numba thread set ?

itzsimpl added a commit to itzsimpl/NeMo that referenced this issue Jan 9, 2024
Temporary fix until numba/numba#9387 gets resolved.

Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
itzsimpl added a commit to itzsimpl/NeMo that referenced this issue Jan 9, 2024
@itzsimpl
Copy link
Contributor Author

itzsimpl commented Jan 9, 2024

I have opened a PR, if there are other locations and you prefer to fix by yourself, I can close it.

@titu1994
Copy link
Collaborator

titu1994 commented Jan 9, 2024

Perfect, let's get yours passing the tests and merged

itzsimpl added a commit to itzsimpl/NeMo that referenced this issue Jan 10, 2024
temporary fix until numba/numba#9387 gets resolved.

Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
titu1994 pushed a commit that referenced this issue Jan 11, 2024
temporary fix until numba/numba#9387 gets resolved.

Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
@itzsimpl
Copy link
Contributor Author

Adding this here just for completnes; found it while researching something else. Lightning 2.1.0 changelog mentions:

  • If not set by the user, Lightning will set OMP_NUM_THREADS to num_cpus / num_processes when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks (#18677)

@titu1994
Copy link
Collaborator

Hmm actually were seeing some segfaults randomly in NeMo ASR and it might be due to the numba fix PR (I saw the segfault first on that pr but thought it was random fluke).

We might have to revert that pr for the time being and depend on numba to fix it properly

@itzsimpl
Copy link
Contributor Author

Hmmm, this is strange. I haven't seen anything like this. During training or inference?

We use pyxis/enroot, which by default injects OMP_NUM_THREADS=1 following PyTorch >1.9.0, which does so if the environment variable is not set. Lightning 2.10, will on the other hand set the env (if not previously declared) as mentioned earlier. I have run quick experiments with OMP_NUM_THREADS=num_cpus / num_processes, but at least on first impression the difference seems negligible.

The good thing is that issues with oversubscription (due to numba resetting the value to min(NUMBA_NUM_THREADS, os.sched_getaffinity(), os.num_cpu())) start arising only when no cpu binding is used. This holds at least on most Linux platforms, but os.sched_getaffinity() is not available on Windows and OsX. There's a comment in torch dataloader about this https://github.com/pytorch/pytorch/blob/763ddb396df4bc14791fbf9149d46d5713a699df/torch/utils/data/dataloader.py#L501-L512

As Numba maintainers pointed out, one approach might be to ensure NUMBA_NUM_THREADS is set to match OMP_NUM_THREADS, as this is the value to which the global OMP num_threads (and accordingly also torch num_threads) will be reset to on the first numba call. With pyxis/enroot the injection is simple, but it is not general for all use cases (e.g. docker, ...).

Yet another approach could be to go the route of Pytorch/Lightning and declare both environment variables (if not already set) during NeMo startup (e.g. in exp_manager).

minitu pushed a commit to minitu/NeMo that referenced this issue Jan 19, 2024
…#8145)

temporary fix until numba/numba#9387 gets resolved.

Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this issue Feb 15, 2024
…#8145)

temporary fix until numba/numba#9387 gets resolved.

Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Feb 17, 2024
@itzsimpl
Copy link
Contributor Author

@titu1994 have you noticed any more segfaults?

@titu1994
Copy link
Collaborator

No thankfully none in 1.23 branch and main. Seems to be very random results during that one-two week period

@github-actions github-actions bot removed the stale label Feb 19, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 21, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024
rohitrango pushed a commit to rohitrango/NeMo that referenced this issue Jun 25, 2024
…#8145)

temporary fix until numba/numba#9387 gets resolved.

Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants