Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Normalization in the ASR preprocessor is too slow. #8948

Closed
galv opened this issue Apr 16, 2024 · 3 comments
Closed

Feature Normalization in the ASR preprocessor is too slow. #8948

galv opened this issue Apr 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@galv
Copy link
Collaborator

galv commented Apr 16, 2024

Another issue I discovered while working on #8673 .

Right now, when running Parakeet CTC 1.1B at batch size 32, the preprocessor takes about 10milliseconds (of course, this depends on the lengths of the inputs, but this was with librispeech test other). 90% of the time is taken by feature normalization, and the GPU is hardly active during that time.

image

You can reproduce via running this:

nsys profile -c cudaProfilerApi $(which python) examples/asr/speech_to_text_eval.py  pretrained_name=nvidia/parakeet-ctc-1.1b dataset_manifest=/home/dgalvez/scratch/data/test_other_sorted_downward.json  \
batch_size=$batch_size  output_filename=test_other_decoded.jsonl  amp=$amp  amp_dtype=bfloat16  use_cer=false num_workers=1

The culprit is this code that sequentially does a for loop over the batch:

for i in range(x.shape[0]):
if x[i, :, : seq_len[i]].shape[1] == 1:
raise ValueError(
"normalize_batch with `per_feature` normalize_type received a tensor of length 1. This will result "
"in torch.std() returning nan. Make sure your audio length has enough samples for a single "
"feature (ex. at least `hop_length` for Mel Spectrograms)."
)
x_mean[i, :] = x[i, :, : seq_len[i]].mean(dim=1)
x_std[i, :] = x[i, :, : seq_len[i]].std(dim=1)

First of all, it's actually copying a single scalar value of seq_len to CPU three times per loop. If you have to do a GPU to CPU copy, please do it just once via seq_len = seq_len.cpu(). But the actual kernels being launched are likely to go faster if we just do a single batched operation. Note that the slowness of this sequential for loop operation only increases as you run at larger batch sizes. An extra 10ms of latency during streaming inference is not really acceptable. The person who wrote the code initially probably wasn't sure how to deal with a ragged tensor for doing mean and std reductions, but we can do it straightforwardly with torch.where() and a mask built from seq_len, without ever communicating with the CPU (which is important for cuda stream capture to a graph).

I will either fix this as part of #8673 or make a separate PR, if the change is larger and riskier than I initially think.

You can download the .nsys-rep file I took a screenshot of here:
report6.nsys-rep.gz

@galv galv added the bug Something isn't working label Apr 16, 2024
@titu1994
Copy link
Collaborator

Could you put it to a separate pr ? Let's keep prs small / medium sized and focussed so we can debug them if they cause issues later.

@titu1994
Copy link
Collaborator

titu1994 commented Apr 16, 2024

Also, don't make the assumption that a GPU will always be there (or that cuda graphs are supported). We can have fast paths for those cases when GPU is available or cudapython 12.3 and above is installed via conda / pip but they aren't guaranteed.

@galv
Copy link
Collaborator Author

galv commented Apr 16, 2024

[don't make the assumption] that cuda graphs are supported

It is good practice to avoid synchronizing with the CPU in general, with GPU code. If we do that, we can get cuda graphs "for free". No part of that is building in cuda graphs as a requirement to do execution.

galv added a commit to galv/NeMo that referenced this issue Apr 18, 2024
At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.
galv added a commit to galv/NeMo that referenced this issue Apr 18, 2024
At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.
galv added a commit to galv/NeMo that referenced this issue Apr 29, 2024
At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
galv added a commit to galv/NeMo that referenced this issue Apr 29, 2024
At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
suiyoubi pushed a commit that referenced this issue May 2, 2024
…hen doing per_feature normalization (#8964)

* Do feature normalization in parallel, rather than via a for loop.

At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue #8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* Remove all instances of cudaStreamSynchronize() in the featurizer when
doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix crash in CPU mode.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this issue Jun 25, 2024
…raph when doing per_feature normalization (NVIDIA#8964)

* Do feature normalization in parallel, rather than via a for loop.

At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* Remove all instances of cudaStreamSynchronize() in the featurizer when
doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix crash in CPU mode.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants