-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Profiling Canary training showed that the old implementation is taking about 15% of the total forward step time, while the new one is barely noticeable. It's likely this will speed up ASR training across all architectures.
The CPU-only tests are failing on:
@galv can you add a check to only call CUDA API if it is available? (sth like |
At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…hen doing per_feature normalization (#8964) * Do feature normalization in parallel, rather than via a for loop. At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue #8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix crash in CPU mode. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Ao Tang <aot@nvidia.com>
…raph when doing per_feature normalization (NVIDIA#8964) * Do feature normalization in parallel, rather than via a for loop. At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix crash in CPU mode. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Commit 1:
Do feature normalization in parallel, rather than via a for loop.
At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue #8948.
Commit 2:
Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization.
With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.
No unit test. I simply ran:
Before my changes:
After my changes:
You can see that the WER went down by 0.01%, implying a small change took place. I don't think it is a concern, but worth noting nonetheless.