Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles #2203

Closed
wants to merge 5 commits into from

Conversation

nateanl
Copy link
Member

@nateanl nateanl commented Feb 4, 2022

We refactored the demo script that can apply RNNT decoding using both torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH and torchaudio.prototype.pipelines.EMFORMER_RNNT_BASE_TEDLIUM3 in both streaming and non-streaming mode. (The first hypothesis prediction is streaming and the second one is non-streaming).

We convert each token id sequence to word pieces and then manually join the word pieces. This allows us to preserve leading whitespaces on output strings and therefore account for word breaks and continuations across token processor invocations, which is particularly useful when performing streaming ASR.

demo.mov

@facebook-github-bot
Copy link
Contributor

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script looks okay, but code-wise what is the difference from the equivalent one from librispeech_emformer_rnnt?

@nateanl
Copy link
Member Author

nateanl commented Feb 9, 2022

code-wise what is the difference from the equivalent one from librispeech_emformer_rnnt?

It's actually the same, only the pipeline-related part is different.

@nateanl nateanl force-pushed the rnnt_tedlium_pipeline branch from ec63c19 to 6393298 Compare February 10, 2022 21:08
@nateanl nateanl force-pushed the rnnt_tedlium_pipeline branch from 6393298 to e53b7c5 Compare February 10, 2022 21:10
Copy link
Contributor

@hwangjeff hwangjeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the readme also references pipeline_demo.py — can you update it to be consistent with the changes here?

@nateanl nateanl requested review from hwangjeff and mthrok February 11, 2022 06:54
@nateanl nateanl changed the title Add demo script for EMFORMER_RNNT_BASE_TEDLIUM3 pipeline Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles Feb 11, 2022
Copy link
Contributor

@hwangjeff hwangjeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the video, why do the word pieces for each streaming transcription show up all at once?

@nateanl
Copy link
Member Author

nateanl commented Feb 11, 2022

why do the word pieces for each streaming transcription show up all at once?

I guess it's because it's run on AWS cluster, there may be some delay when printing to the screen.

@nateanl nateanl requested a review from hwangjeff February 11, 2022 15:33
@hwangjeff
Copy link
Contributor

I guess it's because it's run on AWS cluster, there may be some delay when printing to the screen.

in that case, for the screen capture, can you run the script locally? that way, we can clearly show users what we mean by streaming asr and how responsive it is using the bundles (example: #2192)

@nateanl
Copy link
Member Author

nateanl commented Feb 11, 2022

Sure. The demo video has been updated @hwangjeff

@facebook-github-bot
Copy link
Contributor

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@hwangjeff hwangjeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good — thanks!

logger = logging.getLogger()


def get_dataset(model_type, dataset_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating the key validation here and there, but these options in a dictionary, sthen pass the keys to choices arguments of argpsrse. So that the closed set of available options are defined once and only once.



def parse_args():
parser = ArgumentParser()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add module level docstring that describes the gist of this script, then pass it to the help description so that it's easy to see what this script does.

@@ -0,0 +1,94 @@
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add shebang line.

xiaohui-zhang pushed a commit to xiaohui-zhang/audio that referenced this pull request May 4, 2022
…ytorch#2203)

Summary:
We refactored the demo script that can apply RNNT decoding using both `torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH` and `torchaudio.prototype.pipelines.EMFORMER_RNNT_BASE_TEDLIUM3` in both streaming and non-streaming mode. (The first hypothesis prediction is streaming and the second one is non-streaming).

We convert each token id sequence to word pieces and then manually join the word pieces. This allows us to preserve leading whitespaces on output strings and therefore account for word breaks and continuations across token processor invocations, which is particularly useful when performing streaming ASR.

https://user-images.githubusercontent.com/8653221/153627956-f0806f18-3c1c-44df-ac07-ec2def58a0cf.mov

Pull Request resolved: pytorch#2203

Reviewed By: carolineechen

Differential Revision: D34006388

Pulled By: nateanl

fbshipit-source-id: 3d31173ee10cdab8a2f5802570e22b50fcce5632
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants