-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Normalization in the ASR preprocessor is too slow. #8948
Comments
Could you put it to a separate pr ? Let's keep prs small / medium sized and focussed so we can debug them if they cause issues later. |
Also, don't make the assumption that a GPU will always be there (or that cuda graphs are supported). We can have fast paths for those cases when GPU is available or cudapython 12.3 and above is installed via conda / pip but they aren't guaranteed. |
It is good practice to avoid synchronizing with the CPU in general, with GPU code. If we do that, we can get cuda graphs "for free". No part of that is building in cuda graphs as a requirement to do execution. |
At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948.
At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948.
At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
…hen doing per_feature normalization (#8964) * Do feature normalization in parallel, rather than via a for loop. At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue #8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix crash in CPU mode. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Ao Tang <aot@nvidia.com>
…raph when doing per_feature normalization (NVIDIA#8964) * Do feature normalization in parallel, rather than via a for loop. At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix crash in CPU mode. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Another issue I discovered while working on #8673 .
Right now, when running Parakeet CTC 1.1B at batch size 32, the preprocessor takes about 10milliseconds (of course, this depends on the lengths of the inputs, but this was with librispeech test other). 90% of the time is taken by feature normalization, and the GPU is hardly active during that time.
You can reproduce via running this:
The culprit is this code that sequentially does a for loop over the batch:
NeMo/nemo/collections/asr/parts/preprocessing/features.py
Lines 65 to 73 in 468d5b6
First of all, it's actually copying a single scalar value of seq_len to CPU three times per loop. If you have to do a GPU to CPU copy, please do it just once via
seq_len = seq_len.cpu()
. But the actual kernels being launched are likely to go faster if we just do a single batched operation. Note that the slowness of this sequential for loop operation only increases as you run at larger batch sizes. An extra 10ms of latency during streaming inference is not really acceptable. The person who wrote the code initially probably wasn't sure how to deal with a ragged tensor for doing mean and std reductions, but we can do it straightforwardly with torch.where() and a mask built from seq_len, without ever communicating with the CPU (which is important for cuda stream capture to a graph).I will either fix this as part of #8673 or make a separate PR, if the change is larger and riskier than I initially think.
You can download the .nsys-rep file I took a screenshot of here:
report6.nsys-rep.gz
The text was updated successfully, but these errors were encountered: