-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ASR CTC inference tutorial #2106
Conversation
# ~~~~~~~~~~~~~~~~ | ||
# | ||
|
||
def download_file(url, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be coming from the other tutorials but I think we can simply use torch.hub.download_url_to_file
.
# | ||
# The tokens are the possible symbols that the acoustic model can predict, | ||
# including the blank and silent symbols. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to lexicon bellow, can you show a part of the contents of token file?
speech_url = "https://pytorch.s3.amazonaws.com/torchaudio/tutorial-assets/ctc-decoding/8461-258277-0000.flac" | ||
speech_file = "_assets/speech.flac" | ||
speech_url = "https://pytorch.s3.amazonaws.com/torchaudio/tutorial-assets/ctc-decoding/8461-258277-0000.wav" | ||
speech_file = "/tmp/speech.wav" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use torch.hub.get_dir
? So that configurable default temporary directory will be selected.
Hardcoding /tmp
would leave artifacts when running locally.
256f9e7
to
ee9c8c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. As a follow-up idea, we can add more detail on how the decoding result change based on decoding parameters.
Also we should show n-best result, which is something Greedy Decoder cannot provide.
# We see that the transcript with the lexicon-constrained beam search | ||
# decoder consists of real words, while the greedy decoder can predict | ||
# incorrectly spelled words like “hundrad”. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run pre-commit run
once again.
@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: demonstrate usage of the CTC beam search decoder w/ lexicon constraint and KenLM support, on a LibriSpeech sample and using a pretrained wav2vec2 model rendered: https://485200-90321822-gh.circle-artifacts.com/0/docs/tutorials/asr_inference_with_ctc_decoder_tutorial.html follow-ups: - incorporate `nbest` - demonstrate customizability of different beam search parameters Pull Request resolved: pytorch#2106 Reviewed By: mthrok Differential Revision: D33340946 Pulled By: carolineechen fbshipit-source-id: 0ab838375d96a035d54ed5b5bd9ab4dc8d19adb7
Add tags to the maskedtensor tutorials so they appear when filtering by tag
demonstrate usage of the CTC beam search decoder w/ lexicon constraint and KenLM support, on a LibriSpeech sample and using a pretrained wav2vec2 model
rendered: https://485200-90321822-gh.circle-artifacts.com/0/docs/tutorials/asr_inference_with_ctc_decoder_tutorial.html
follow-ups:
nbest