Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964

Merged
merged 5 commits into from
Apr 30, 2024

Conversation

galv
Copy link
Collaborator

@galv galv commented Apr 18, 2024

Commit 1:

Do feature normalization in parallel, rather than via a for loop.

At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue #8948.

Commit 2:

Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

No unit test. I simply ran:

python examples/asr/speech_to_text_eval.py  pretrained_name=nvidia/parakeet-ctc-1.1b  dataset_manifest=/home/dgalvez/scratch/data/test_other.json  batch_size=16  output_filename=test_other_decoded.jsonl  amp=true  amp_dtype=bfloat16  use_cer=false num_workers=1

Before my changes:

Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:25<00:00,  7.17it/s]
[NeMo I 2024-04-17 14:50:33 transcribe_speech:398] Finished transcribing from manifest file: /home/dgalvez/scratch/data/test_other.json
[NeMo I 2024-04-17 14:50:33 transcribe_speech:403] Writing transcriptions into file: test_other_decoded.jsonl
[NeMo I 2024-04-17 14:50:33 transcribe_speech:421] Finished writing predictions to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:50:33 transcribe_speech:439] Writing prediction and error rate of each sample to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:50:33 transcribe_speech:440] {'samples': 2939, 'tokens': 52343, 'wer': 0.037712779168179125, 'ins_rate': 0.004203045297365455, 'del_rate': 0.0022734654108476776, 'sub_rate': 0.031236268459965993}
[NeMo I 2024-04-17 14:50:33 speech_to_text_eval:131] Finished transcribing speech dataset. Computing ASR metrics..
[NeMo I 2024-04-17 14:50:33 speech_to_text_eval:210] Dataset WER/CER 3.77%/1.41%

After my changes:

Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:25<00:00,  7.18it/s]
[NeMo I 2024-04-17 14:57:37 transcribe_speech:398] Finished transcribing from manifest file: /home/dgalvez/scratch/data/test_other.json
[NeMo I 2024-04-17 14:57:38 transcribe_speech:403] Writing transcriptions into file: test_other_decoded.jsonl
[NeMo I 2024-04-17 14:57:38 transcribe_speech:421] Finished writing predictions to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:57:38 transcribe_speech:439] Writing prediction and error rate of each sample to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:57:38 transcribe_speech:440] {'samples': 2939, 'tokens': 52343, 'wer': 0.03761725541142082, 'ins_rate': 0.004222150048717116, 'del_rate': 0.0022352559081443555, 'sub_rate': 0.03115984945455935}
[NeMo I 2024-04-17 14:57:38 speech_to_text_eval:131] Finished transcribing speech dataset. Computing ASR metrics..
[NeMo I 2024-04-17 14:57:38 speech_to_text_eval:210] Dataset WER/CER 3.76%/1.41%

You can see that the WER went down by 0.01%, implying a small change took place. I don't think it is a concern, but worth noting nonetheless.

@github-actions github-actions bot added the ASR label Apr 18, 2024
pzelasko
pzelasko previously approved these changes Apr 25, 2024
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Profiling Canary training showed that the old implementation is taking about 15% of the total forward step time, while the new one is barely noticeable. It's likely this will speed up ASR training across all architectures.

@pzelasko pzelasko mentioned this pull request Apr 25, 2024
8 tasks
@pzelasko
Copy link
Collaborator

The CPU-only tests are failing on:

2024-04-25T23:47:10.4949722Z nemo/collections/asr/parts/preprocessing/features.py:466: in forward
2024-04-25T23:47:10.4950460Z     x, _, _ = normalize_batch(x, seq_len, normalize_type=self.normalize)
2024-04-25T23:47:10.4951858Z nemo/collections/asr/parts/preprocessing/features.py:67: in normalize_batch
2024-04-25T23:47:10.4952730Z     if not torch.cuda.is_current_stream_capturing() and torch.any(seq_len == 1).item():
2024-04-25T23:47:10.4953491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-04-25T23:47:10.4953904Z 
2024-04-25T23:47:10.4954053Z     def is_current_stream_capturing():
2024-04-25T23:47:10.4954490Z         r"""
2024-04-25T23:47:10.4955100Z         Returns True if CUDA graph capture is underway on the current CUDA stream, False otherwise.
2024-04-25T23:47:10.4955767Z     
2024-04-25T23:47:10.4956394Z         If a CUDA context does not exist on the current device, returns False without initializing the context.
2024-04-25T23:47:10.4957165Z         """
2024-04-25T23:47:10.4957517Z >       return _cuda_isCurrentStreamCapturing()
2024-04-25T23:47:10.4958284Z E       RuntimeError: CUDA error: no CUDA-capable device is detected
2024-04-25T23:47:10.4959247Z E       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-04-25T23:47:10.4960142Z E       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-04-25T23:47:10.4961287Z E       Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
2024-04-25T23:47:10.4961983Z 
2024-04-25T23:47:10.4962405Z /usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py:28: RuntimeError

@galv can you add a check to only call CUDA API if it is available? (sth like if torch.cuda.is_available(): don't remember the name exactly)

galv and others added 5 commits April 29, 2024 11:42
At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pzelasko pzelasko merged commit a9dac0e into NVIDIA:main Apr 30, 2024
126 checks passed
suiyoubi pushed a commit that referenced this pull request May 2, 2024
…hen doing per_feature normalization (#8964)

* Do feature normalization in parallel, rather than via a for loop.

At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue #8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* Remove all instances of cudaStreamSynchronize() in the featurizer when
doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix crash in CPU mode.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…raph when doing per_feature normalization (NVIDIA#8964)

* Do feature normalization in parallel, rather than via a for loop.

At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue NVIDIA#8948.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* Remove all instances of cudaStreamSynchronize() in the featurizer when
doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix crash in CPU mode.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants