Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wav2vec2: support datasets other than LibriSpeech #10581

Merged

Conversation

elgeish
Copy link
Contributor

@elgeish elgeish commented Mar 8, 2021

What does this PR do?

Building on #10145, I'm adding support for the two other speech datasets (besides LibriSpeech) for ASR at the time of writing (timit_asr and arabic_speech_corpus), which require the following:

  • Custom validation split name
  • On-the-fly resampling support to target feature extractor's sampling rate (via librosa -- see requirements.txt)
  • Max duration (in seconds) filter to remove outliers (which may crash GPU training when running OOM)
  • Verbose logging to help debug custom datasets, including reverse transliteration (via lang-trans -- see requirements.txt)
  • Pre-processing: tokenization and text normalization using orthography rules
    • Casing (do_lower_case)
    • Custom vocab for tokens used in orthography (e.g., Buckwalter transliteration for Arabic)
    • Custom word delimiter token (when the default, "|", is used in orthography)
    • Transformations similar to those in jiwer to normalize transcripts for training
    • Removing special words (e.g., "sil" which can be used to indicate silence)
    • Translation table (e.g., "-" -> " " to break compounds like "quarter-century-old")
    • Cleaning up characters not in vocab (after applying the rules above)

Arabic model: https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic
TIMIT models: https://huggingface.co/elgeish/wav2vec2-base-timit-asr and https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr

Who can review?

@patrickvonplaten @sgugger @LysandreJik @patil-suraj @SeanNaren

@elgeish elgeish changed the title wav2vec2: support datasets other than LibriSpeech [WIP] wav2vec2: support datasets other than LibriSpeech Mar 8, 2021
@elgeish
Copy link
Contributor Author

elgeish commented Mar 10, 2021

Thanks to a fix for timit_asr that @patrickvonplaten made, now I have some good results using wav2vec2-base:

timit_asr_pr_ 10581

I'm running one for arabic_speech_corpus using wav2vec2-base as well. @patrickvonplaten let me know when you have the rest of configs for https://huggingface.co/facebook/wav2vec2-large-xlsr uploaded so I can try it as well (or a workaround). Thanks!

@Getmany1
Copy link

Getmany1 commented Mar 11, 2021

Thanks for adding this functionality! Your script worked succesfully when I fine-tuned wav2vec2-base , xlsr (https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt) and wa2vec2-large-100k (multilingual Large Model from https://github.com/facebookresearch/voxpopuli#pre-trained-models pre-trained on VoxPopuli dataset) on TIMIT dataset. If fine-tuning on some another custom dataset, is it enough to set --orthography to timit in run_asr.py if the transcriptions are lowercased and librispeech if they are uppercased?

@elgeish
Copy link
Contributor Author

elgeish commented Mar 11, 2021

Thanks for adding this functionality! Your script worked succesfully when I fine-tuned wav2vec2-base , xlsr (https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt) and wa2vec2-large-100k (multilingual Large Model from https://github.com/facebookresearch/voxpopuli#pre-trained-models pre-trained on VoxPopuli dataset) on TIMIT dataset.

Thanks, @Getmany1 - you can do me a favor and run it with arabic_speech_corpus dataset and --target_feature_extractor_sampling_rate --orthography buckwalter on xlsr to verify it works with extended vocab. Unfortunately I can't fit xlsr on my machine.

If fine-tuning on some another custom dataset, is it enough to set --orthography to timit in run_asr.py if the transcriptions are lowercased and librispeech if they are uppercased?
For the most part, yes! Run it with --verbose_logging to see how the orthography rules pre-processed the text. Keep us posted!

@elgeish elgeish changed the title [WIP] wav2vec2: support datasets other than LibriSpeech wav2vec2: support datasets other than LibriSpeech Mar 11, 2021
@Getmany1
Copy link

Getmany1 commented Mar 12, 2021

XLSR model I used didn't work with this setup: the training loss is nan when I tried to fine-tune on Arabic corpus. If I got it correctly, with --orthography buckwalter you modify the tokenizer only. However, if you load e.g.
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53")
and check it structure
print(model.state_dict)
you'll see that the last layer of the network is the LM head with the default vocabulary size:
(lm_head): Linear(in_features=1024, out_features=32, bias=True)
If I understand correctly, you need to convert the model manually if you want to have a letter vocabulary different from english.
I converted the fairseq xlsr checkpoint using this script
transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py
together with "custom", Swedish letter dictionary, and then succeeded to fine-tune it on a Swedish corpus.
I guess you need to do the same for Arabic in order to have proper LM head on top of the model.

@elgeish elgeish changed the title wav2vec2: support datasets other than LibriSpeech [WIP] wav2vec2: support datasets other than LibriSpeech Mar 12, 2021
@elgeish
Copy link
Contributor Author

elgeish commented Mar 12, 2021

Yes, thanks! I missed that step (which is similar to PreTrainedModel.resize_token_embeddings()). I'm adding a method for that: Wav2Vec2ForCTC.resize_lm_head() which simply calls PreTrainedModel.model._get_resized_lm_head() and updates model config. The LM head looks good when inspecting the return value. I'll add some unit tests as well.

Now I'm fine-tuning it on wav2vec2-base but I'm not expecting great results given the phonetic differences with Arabic. If you can try it again with xlsr, I'd appreciate it!

@Getmany1
Copy link

Thanks @elgeish, the method seems to work correctly. Now the loss is not nan any more during the fine-tuning of the xlsr model. Training loss:

{'loss': 584.0067, 'learning_rate': 8.333333333333333e-05, 'epoch': 0.28}
{'loss': 291.7098, 'learning_rate': 0.00016666666666666666, 'epoch': 0.55}

Validation loss:

{'eval_loss': 374.4364929199219, 'eval_wer': 1.0, 'eval_runtime': 22.3424, 'eval_samples_per_second': 4.476, 'epoch': 0.28}
{'eval_loss': 374.20855712890625, 'eval_wer': 1.0, 'eval_runtime': 22.8504, 'eval_samples_per_second': 4.376, 'epoch': 0.55}

Predictions after epoch 0.55:

03/12/2021 12:07:54 - DEBUG - __main__ -   reference: "wayaquwlu lEulamA'u <in~ahu min gayri lmuraj~aHi >an tuTaw~ira lbaktiyryA lmuEdiyapu muqAwamapan Did~a lEilAji ljadiyd >al~a*iy >aSbaHa mutAHan biAlfiEl fiy $akli marhamin lil>amrADi ljildiy~api"
03/12/2021 12:07:54 - DEBUG - __main__ -   prediction: ""
03/12/2021 12:07:54 - DEBUG - __main__ -   reference: "wayumkinuka lHuSuwlu EalY taTbiyqAtin lilt~adriybAti l>asAsiy~api maj~Anan"
03/12/2021 12:07:54 - DEBUG - __main__ -   prediction: ""

Unfortunately, after that I always run out of memory in Colab. The recordings in the Arabic corpus are probably very long. As a result even with batch_size=1 16Gb of GPU memory is not enough:

RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.90 GiB total capacity; 14.80 GiB already allocated; 87.75 MiB free; 14.94 GiB reserved in total by PyTorch)
  2% 1050/54390 [07:16<6:09:57,  2.40it/s]

If you have more than 16Gb of GPU memory available, I can share the xlsr model to try it out on your machine.

@elgeish
Copy link
Contributor Author

elgeish commented Mar 12, 2021

I added a --max_duration_in_seconds filter. I'm seeing ok results now fine-tuning wav2vec2-base after 500 steps, for example:

reference: "wamin tilka ls~ilaE >al$~Ayu lS~iyniy~u wAlwaraqu wAlbAruwdu wAlbuwSilapu"
prediction: "wamin tiloka Als~ila>a$~aAyu AS~iyniy~u walowaraqu waAlobaAruwdu waAlobuwSilapu"

I've also started fine-tuning wav2vec2-large-xlsr-53 after adding missing files locally on my machine.

For future PRs, I'm thinking of:

  • Supporting other languages and examples from Patrick's awesome blog post, which I need to read
  • CER and WER transformations for other languages (e.g., ignoring tashkil in Arabic)
  • Supporting Lhotse datasets, which provide a ton of speech-specific functionality

Before adding UTs, I want to check with @patrickvonplaten the code here is on the right track.
@Getmany1 you've been super helpful, thank you!

@elgeish elgeish changed the title [WIP] wav2vec2: support datasets other than LibriSpeech wav2vec2: support datasets other than LibriSpeech Mar 14, 2021
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @elgeish,

Thanks a lot for your PR! Sorry that I only reviewed it now. In general I like the changes you have made to run_wav2vec2.py :-)

A couple of things, I'd like to change:

  • Remove the change from modeling_wav2vec2.py. I don't really want a resize lm head method in the model there -> it's too much of an edge case IMO. You can simply first instantiate the processor then use the tokenizer length to load the model with Wav2Vec2ForCTC.from_pretrained(model_path, vocab_size=len(tokenizer)). This way we don't need a resize_lm_head() method.
  • Can you give me more context on the buckwalter.json file?
  • Can you also add more text to the README.md that explains a bit how to use the run_wav2vec2.py script? E.g. what does the orthography class do, what is the buckwalter vocab?

@elgeish
Copy link
Contributor Author

elgeish commented Mar 17, 2021

Hey @elgeish,

Thanks a lot for your PR! Sorry that I only reviewed it now. In general I like the changes you have made to run_wav2vec2.py :-)

A couple of things, I'd like to change:

  • Remove the change from modeling_wav2vec2.py. I don't really want a resize lm head method in the model there -> it's too much of an edge case IMO. You can simply first instantiate the processor then use the tokenizer length to load the model with Wav2Vec2ForCTC.from_pretrained(model_path, vocab_size=len(tokenizer)). This way we don't need a resize_lm_head() method.
  • Can you give me more context on the buckwalter.json file?
  • Can you also add more text to the README.md that explains a bit how to use the run_wav2vec2.py script? E.g. what does the orthography class do, what is the buckwalter vocab?

Thank you! I responded inline. I'll update README.md as well. I think you mean run_asr.py, no?

Feel free to ask a question on the [Forum](https://discuss.huggingface.co/) or post an issue on [GitHub](https://github.com/huggingface/transformers/issues/new/choose) and adding `@patrickvonplaten` as a tag.

### Fine-Tuning with TIMIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice explanations - thanks for adding this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pleasure! Thanks for the review and nudges :)

@patrickvonplaten
Copy link
Contributor

Great the PR looks good to me now! Thanks a lot for doing this :-) The failures seem unrelated, so I'll rerun the tests again to make the CI happy

@patrickvonplaten patrickvonplaten merged commit af8afdc into huggingface:master Mar 18, 2021
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* wav2vec2: support datasets other than LibriSpeech

* Formatting run_asr.py to pass code quality test

* bundled orthography options and added verbose logs

* fixing a typo in timit fine-tuning script

* update comment for clarity

* resize_lm_head and load custom vocab from file

* adding a max_duration_in_seconds filter

* do not assign `duration_filter` lambda, use a def

* log untransliterated text as well

* fix base model for arabic

* fix duration filter when target_sr is not set

* drop duration_in_seconds when unneeded

* script for wav2vec2-large-lv60-timit-asr

* fix for "tha" in arabic corpus (huggingface#10581)

* adding more options to work with common_voice

* PR feedback (huggingface#10581)

* small README change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants