-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wav2vec2: support datasets other than LibriSpeech #10581
wav2vec2: support datasets other than LibriSpeech #10581
Conversation
Thanks to a fix for I'm running one for |
Thanks for adding this functionality! Your script worked succesfully when I fine-tuned |
Thanks, @Getmany1 - you can do me a favor and run it with
|
|
Yes, thanks! I missed that step (which is similar to Now I'm fine-tuning it on |
Thanks @elgeish, the method seems to work correctly. Now the loss is not
Validation loss:
Predictions after epoch 0.55:
Unfortunately, after that I always run out of memory in Colab. The recordings in the Arabic corpus are probably very long. As a result even with
If you have more than 16Gb of GPU memory available, I can share the |
I added a
I've also started fine-tuning wav2vec2-large-xlsr-53 after adding missing files locally on my machine. For future PRs, I'm thinking of:
Before adding UTs, I want to check with @patrickvonplaten the code here is on the right track. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @elgeish,
Thanks a lot for your PR! Sorry that I only reviewed it now. In general I like the changes you have made to run_wav2vec2.py
:-)
A couple of things, I'd like to change:
- Remove the change from
modeling_wav2vec2.py
. I don't really want a resize lm head method in the model there -> it's too much of an edge case IMO. You can simply first instantiate the processor then use the tokenizer length to load the model withWav2Vec2ForCTC.from_pretrained(model_path, vocab_size=len(tokenizer))
. This way we don't need aresize_lm_head()
method. - Can you give me more context on the
buckwalter.json
file? - Can you also add more text to the README.md that explains a bit how to use the
run_wav2vec2.py
script? E.g. what does the orthography class do, what is the buckwalter vocab?
Thank you! I responded inline. I'll update |
Feel free to ask a question on the [Forum](https://discuss.huggingface.co/) or post an issue on [GitHub](https://github.com/huggingface/transformers/issues/new/choose) and adding `@patrickvonplaten` as a tag. | ||
|
||
### Fine-Tuning with TIMIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice explanations - thanks for adding this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My pleasure! Thanks for the review and nudges :)
Great the PR looks good to me now! Thanks a lot for doing this :-) The failures seem unrelated, so I'll rerun the tests again to make the CI happy |
* wav2vec2: support datasets other than LibriSpeech * Formatting run_asr.py to pass code quality test * bundled orthography options and added verbose logs * fixing a typo in timit fine-tuning script * update comment for clarity * resize_lm_head and load custom vocab from file * adding a max_duration_in_seconds filter * do not assign `duration_filter` lambda, use a def * log untransliterated text as well * fix base model for arabic * fix duration filter when target_sr is not set * drop duration_in_seconds when unneeded * script for wav2vec2-large-lv60-timit-asr * fix for "tha" in arabic corpus (huggingface#10581) * adding more options to work with common_voice * PR feedback (huggingface#10581) * small README change
What does this PR do?
Building on #10145, I'm adding support for the two other speech datasets (besides LibriSpeech) for ASR at the time of writing (
timit_asr
andarabic_speech_corpus
), which require the following:librosa
-- seerequirements.txt
)lang-trans
-- seerequirements.txt
)do_lower_case
)"|"
, is used in orthography)jiwer
to normalize transcripts for trainingArabic model: https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic
TIMIT models: https://huggingface.co/elgeish/wav2vec2-base-timit-asr and https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr
Who can review?
@patrickvonplaten @sgugger @LysandreJik @patil-suraj @SeanNaren