-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dgalvez/cuda graphs greedy rnnt inference squash #8191
Dgalvez/cuda graphs greedy rnnt inference squash #8191
Conversation
jenkins |
3a79df5
to
d5d0a15
Compare
tests/collections/asr/decoding/test_fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
This PR is ready for review. @artbataev @titu1994 would you be willing? |
I can review this tomorrow @artbataev could you review it too ? |
Cool, I will review the PR today or tomorrow. @galv Please fix DCO (anyway, you will need to fix it for merging). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks decent but needs some minor fixes. Needs removal of nvtx, docstring/comments for complex functions, and dynamic import guard instead of directly importing cuda python without check.
Apart from that, just wanted to note that is incredibly complex work, serious kudos to developing this @galv
nemo/collections/asr/parts/submodules/fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
nemo/collections/asr/parts/submodules/fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
torch.cuda.nvtx.range_pop() | ||
|
||
torch.cuda.nvtx.range_push("Convert to Hypotheses") | ||
hypotheses = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this the section you mentioned you wanted to speedup with numba CPU ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up doing it with pytorch, which was definitely fast enough. One tricky bit about numba was the lack of bfloat16 support. And we can end up with scores in bfloat16 if the model is running in bfloat16.
nemo/collections/asr/parts/submodules/fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
nemo/collections/asr/parts/submodules/fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
|
||
|
||
def ASSERT_DRV(err): | ||
if isinstance(err, cuda.CUresult): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside any function that uses cuda python, check if HAVE_CUDA_PYTHON, and if not call check_cuda_python_cuda_graphs_conditional_nodes_supported() to give users a meaningful error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that simple, though. It's not just about whether we can import cuda
. We also need to check that the appropriate version of cuda-python itself and the cuda driver are installed. https://github.com/NVIDIA/NeMo/pull/8191/files#diff-acab1a9f3d702862ddbe5720bfa6c7fd0a57f7c3dc0b59eb9878ed5cd1e3513aR28-R45
Maybe what you want in this case is something like https://github.com/NVIDIA/NeMo/blob/main/nemo/core/utils/k2_guard.py. We will simply expect those developing with cuda-python to do from cuda_python_guard.cuda import ...
instead from cuda import ...
Jenkins |
tests/collections/asr/decoding/test_fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
conf["decoding"]["greedy"]["max_symbols"] = 5 | ||
conf["decoding"]["greedy"]["loop_labels"] = False | ||
|
||
with tempfile.NamedTemporaryFile() as fp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change_decoding_strategy
should be enough. Saving/restoring models takes a lot of time, it's better to avoid serialization if possible in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip. It sped up test execution almost 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nemo/collections/asr/parts/submodules/fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
nemo/collections/asr/parts/submodules/fast_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
9328d1f
to
e5827b1
Compare
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Show resolved
Hide resolved
7b3aaf7
to
e2d4174
Compare
This uses CUDA 12.3's conditional node support. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
e2d4174
to
5b31417
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks great! Thanks! @artbataev please go through it once more and merge when ready
jenkins |
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Fixed
Show fixed
Hide fixed
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Fixed
Show fixed
Hide fixed
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Fixed
Show fixed
Hide fixed
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@galv, the issue still exists.
- See details about decoding with
cuda=1
below - Also, please change the PR name (e.g., "Cuda graphs rnnt inference") and description to match the current code (e.g.,
go_very_fast
is no more valid). - If possible, also please move the
with_conditional_node
code to a separate file to make it reusable (but this is not a blocker)
Issue with cuda != 0
I still see an error, but the error is different now. Did you test it? Do you need a multi-gpu machine?
I do not want to block the PR, but I think that it is a bug and should be fixed if possible.
Cuda 1 + cuda graphs: Fail
python examples/asr/speech_to_text_eval.py pretrained_name=stt_en_fastconformer_transducer_large dataset_manifest=test_other.json batch_size=16 output_filename=test_other_decoded.jsonl amp=false amp_dtype=bfloat16 rnnt_decoding.greedy.use_cuda_graph_decoder=true rnnt_decoding.greedy.loop_labels=false cuda=1
...
in _reinitialize
logp = self.caller._joint_step(self.f, g, log_normalize=None)[:, 0, 0, :]
...
RuntimeError: CUDA error: operation not permitted when stream is capturing
Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.
During handling of the above exception, another exception occurred:
...
RuntimeError: Capture must end on the same stream it began on.
Cuda 0 + cuda graphs: OK
python examples/asr/speech_to_text_eval.py pretrained_name=stt_en_fastconformer_transducer_large dataset_manifest=test_other.json batch_size=16 output_filename=test_other_decoded.jsonl amp=false amp_dtype=bfloat16 rnnt_decoding.greedy.use_cuda_graph_decoder=true rnnt_decoding.greedy.loop_labels=false cuda=0
Cuda 1 + no cuda graphs (both loop frames/loop labels): OK
nemo/collections/asr/parts/submodules/cuda_graph_rnnt_greedy_decoding.py
Outdated
Show resolved
Hide resolved
@galv as I see, the issue can be fixed when passing appropriate device to cuda streams initializers and getters:
|
@artbataev thank you for the initial suggestion. It works when the decoder has not been run yet. However, it doesn't work if the decoder has already been run. You can see my failing test here: 36b3273 Clearly something obscure is happening here. The commit message provides more details. I've spent a few hours trying to debug this so I need to stop for the day. |
@Galvi tried some changes, and it seems I can get it to work.
# Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
stream_for_graph = torch.cuda.Stream(self.device)
with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(self.graph, stream=stream_for_graph):
... # capture graph
# pass device explicitly
capture_status, _, graph, _, _ = cu_call(
cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=self.device).cuda_stream)
)
...
@contextlib.contextmanager
def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
...
# pass device explicitly here and in other calls
capture_status, _, graph, _, _ = cu_call(cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream))
... You can see the full commit here: artbataev@77fc36e |
Thank you, Vladimir. Signed-off-by: Daniel Galvez <dgalvez@computelab-frontend-3.nvidia.com>
0c509e7
to
7bbbe3d
Compare
for more information, see https://pre-commit.ci
jenkins |
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
7117260
to
118c01a
Compare
It will crash in cuda-python 12.4.0. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
dc3d1ff
to
fb2bd7a
Compare
for more information, see https://pre-commit.ci
jenkins |
jenkins Previous failure seems to be a spurious failure caused by git clone failing. @artbataev I incorporated your change after verifying it on a multi-GPU machine. Thank you again. I made one more commit fb2bd7a as well which makes this work with cuda-python version 12.4.0 and greater. It turns out that the bug fix in that version makes the phGraph_out variable not writable. So I must not use my workaround when cuda-python > 12.3.0. Things are well tested at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the really speed-of-light decoding with Cuda graphs!
@galv I manually restarted Jenkins, but it is still waiting for an executor |
@galv please fix the test failing on Jenkins (the guard is needed)
|
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
1738de1
to
e099042
Compare
jenkins Sorry for missing the guard in that test. Hopefully things go through now. |
jenkins |
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: NVIDIA#8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. NVIDIA#8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: ataghibakhsh <ataghibakhsh@nvidia.com>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: NVIDIA#8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. NVIDIA#8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
What does this PR do ?
Speeds up RNN-T greedy decoding greatly by eliminating the 90% of the time that the GPU is idle, waiting on the CPU, via cuda graphs with conditional nodes.
Here are some results for transcribing librispeech test other, a 5.4 hour dataset, on an A100, with bfloat16, at batch size 16:
You can see that we get a 3.125x speed up with a 600 million parameter model, and a 2.65x speedup with a 1.1 billion parameter model.
This benchmark comes from running the following. Note that I exclude the time required to create the cuda graph from the timing measurement. This fits an inference use case where that is a one time task.
This is a squashing of #7976 . I wanted to squash that, but unfortunately I reference a few commits in various bugs I filed, and didn't want the links to break.
Collection: ASR. Adds some utils for cuda-python to common.
Changelog
RNNTGreedyDecodeCudaGraph
, which uses cuda graphs with conditional nodes to remove the CPU overhead.Usage
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information