-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles #2203
Conversation
@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script looks okay, but code-wise what is the difference from the equivalent one from librispeech_emformer_rnnt?
It's actually the same, only the pipeline-related part is different. |
ec63c19
to
6393298
Compare
6393298
to
e53b7c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the readme also references pipeline_demo.py
— can you update it to be consistent with the changes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the video, why do the word pieces for each streaming transcription show up all at once?
I guess it's because it's run on AWS cluster, there may be some delay when printing to the screen. |
in that case, for the screen capture, can you run the script locally? that way, we can clearly show users what we mean by streaming asr and how responsive it is using the bundles (example: #2192) |
Sure. The demo video has been updated @hwangjeff |
@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good — thanks!
logger = logging.getLogger() | ||
|
||
|
||
def get_dataset(model_type, dataset_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating the key validation here and there, but these options in a dictionary, sthen pass the keys to choices
arguments of argpsrse. So that the closed set of available options are defined once and only once.
|
||
|
||
def parse_args(): | ||
parser = ArgumentParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add module level docstring that describes the gist of this script, then pass it to the help description so that it's easy to see what this script does.
@@ -0,0 +1,94 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add shebang line.
…ytorch#2203) Summary: We refactored the demo script that can apply RNNT decoding using both `torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH` and `torchaudio.prototype.pipelines.EMFORMER_RNNT_BASE_TEDLIUM3` in both streaming and non-streaming mode. (The first hypothesis prediction is streaming and the second one is non-streaming). We convert each token id sequence to word pieces and then manually join the word pieces. This allows us to preserve leading whitespaces on output strings and therefore account for word breaks and continuations across token processor invocations, which is particularly useful when performing streaming ASR. https://user-images.githubusercontent.com/8653221/153627956-f0806f18-3c1c-44df-ac07-ec2def58a0cf.mov Pull Request resolved: pytorch#2203 Reviewed By: carolineechen Differential Revision: D34006388 Pulled By: nateanl fbshipit-source-id: 3d31173ee10cdab8a2f5802570e22b50fcce5632
We refactored the demo script that can apply RNNT decoding using both
torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
andtorchaudio.prototype.pipelines.EMFORMER_RNNT_BASE_TEDLIUM3
in both streaming and non-streaming mode. (The first hypothesis prediction is streaming and the second one is non-streaming).We convert each token id sequence to word pieces and then manually join the word pieces. This allows us to preserve leading whitespaces on output strings and therefore account for word breaks and continuations across token processor invocations, which is particularly useful when performing streaming ASR.
demo.mov