wav2vec2: support datasets other than LibriSpeech #10581

elgeish · 2021-03-08T01:01:33Z

What does this PR do?

Building on #10145, I'm adding support for the two other speech datasets (besides LibriSpeech) for ASR at the time of writing (timit_asr and arabic_speech_corpus), which require the following:

Custom validation split name
On-the-fly resampling support to target feature extractor's sampling rate (via librosa -- see requirements.txt)
Max duration (in seconds) filter to remove outliers (which may crash GPU training when running OOM)
Verbose logging to help debug custom datasets, including reverse transliteration (via lang-trans -- see requirements.txt)
Pre-processing: tokenization and text normalization using orthography rules
- Casing (do_lower_case)
- Custom vocab for tokens used in orthography (e.g., Buckwalter transliteration for Arabic)
- Custom word delimiter token (when the default, "|", is used in orthography)
- Transformations similar to those in jiwer to normalize transcripts for training
- Removing special words (e.g., "sil" which can be used to indicate silence)
- Translation table (e.g., "-" -> " " to break compounds like "quarter-century-old")
- Cleaning up characters not in vocab (after applying the rules above)

Arabic model: https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic
TIMIT models: https://huggingface.co/elgeish/wav2vec2-base-timit-asr and https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr

Who can review?

@patrickvonplaten @sgugger @LysandreJik @patil-suraj @SeanNaren

elgeish · 2021-03-10T03:55:27Z

Thanks to a fix for timit_asr that @patrickvonplaten made, now I have some good results using wav2vec2-base:

I'm running one for arabic_speech_corpus using wav2vec2-base as well. @patrickvonplaten let me know when you have the rest of configs for https://huggingface.co/facebook/wav2vec2-large-xlsr uploaded so I can try it as well (or a workaround). Thanks!

Getmany1 · 2021-03-11T10:06:39Z

Thanks for adding this functionality! Your script worked succesfully when I fine-tuned wav2vec2-base , xlsr (https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt) and wa2vec2-large-100k (multilingual Large Model from https://github.com/facebookresearch/voxpopuli#pre-trained-models pre-trained on VoxPopuli dataset) on TIMIT dataset. If fine-tuning on some another custom dataset, is it enough to set --orthography to timit in run_asr.py if the transcriptions are lowercased and librispeech if they are uppercased?

elgeish · 2021-03-11T18:53:42Z

Thanks for adding this functionality! Your script worked succesfully when I fine-tuned wav2vec2-base , xlsr (https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt) and wa2vec2-large-100k (multilingual Large Model from https://github.com/facebookresearch/voxpopuli#pre-trained-models pre-trained on VoxPopuli dataset) on TIMIT dataset.

Thanks, @Getmany1 - you can do me a favor and run it with arabic_speech_corpus dataset and --target_feature_extractor_sampling_rate --orthography buckwalter on xlsr to verify it works with extended vocab. Unfortunately I can't fit xlsr on my machine.

If fine-tuning on some another custom dataset, is it enough to set --orthography to timit in run_asr.py if the transcriptions are lowercased and librispeech if they are uppercased?
For the most part, yes! Run it with --verbose_logging to see how the orthography rules pre-processed the text. Keep us posted!

Getmany1 · 2021-03-12T06:08:45Z

XLSR model I used didn't work with this setup: the training loss is nan when I tried to fine-tune on Arabic corpus. If I got it correctly, with --orthography buckwalter you modify the tokenizer only. However, if you load e.g.
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53")
and check it structure
print(model.state_dict)
you'll see that the last layer of the network is the LM head with the default vocabulary size:
(lm_head): Linear(in_features=1024, out_features=32, bias=True)
If I understand correctly, you need to convert the model manually if you want to have a letter vocabulary different from english.
I converted the fairseq xlsr checkpoint using this script
transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py
together with "custom", Swedish letter dictionary, and then succeeded to fine-tune it on a Swedish corpus.
I guess you need to do the same for Arabic in order to have proper LM head on top of the model.

elgeish · 2021-03-12T09:23:47Z

Yes, thanks! I missed that step (which is similar to PreTrainedModel.resize_token_embeddings()). I'm adding a method for that: Wav2Vec2ForCTC.resize_lm_head() which simply calls PreTrainedModel.model._get_resized_lm_head() and updates model config. The LM head looks good when inspecting the return value. I'll add some unit tests as well.

Now I'm fine-tuning it on wav2vec2-base but I'm not expecting great results given the phonetic differences with Arabic. If you can try it again with xlsr, I'd appreciate it!

Getmany1 · 2021-03-12T12:44:35Z

Thanks @elgeish, the method seems to work correctly. Now the loss is not nan any more during the fine-tuning of the xlsr model. Training loss:

{'loss': 584.0067, 'learning_rate': 8.333333333333333e-05, 'epoch': 0.28}
{'loss': 291.7098, 'learning_rate': 0.00016666666666666666, 'epoch': 0.55}

Validation loss:

{'eval_loss': 374.4364929199219, 'eval_wer': 1.0, 'eval_runtime': 22.3424, 'eval_samples_per_second': 4.476, 'epoch': 0.28}
{'eval_loss': 374.20855712890625, 'eval_wer': 1.0, 'eval_runtime': 22.8504, 'eval_samples_per_second': 4.376, 'epoch': 0.55}

Predictions after epoch 0.55:

03/12/2021 12:07:54 - DEBUG - __main__ -   reference: "wayaquwlu lEulamA'u <in~ahu min gayri lmuraj~aHi >an tuTaw~ira lbaktiyryA lmuEdiyapu muqAwamapan Did~a lEilAji ljadiyd >al~a*iy >aSbaHa mutAHan biAlfiEl fiy $akli marhamin lil>amrADi ljildiy~api"
03/12/2021 12:07:54 - DEBUG - __main__ -   prediction: ""
03/12/2021 12:07:54 - DEBUG - __main__ -   reference: "wayumkinuka lHuSuwlu EalY taTbiyqAtin lilt~adriybAti l>asAsiy~api maj~Anan"
03/12/2021 12:07:54 - DEBUG - __main__ -   prediction: ""

Unfortunately, after that I always run out of memory in Colab. The recordings in the Arabic corpus are probably very long. As a result even with batch_size=1 16Gb of GPU memory is not enough:

RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.90 GiB total capacity; 14.80 GiB already allocated; 87.75 MiB free; 14.94 GiB reserved in total by PyTorch)
  2% 1050/54390 [07:16<6:09:57,  2.40it/s]

If you have more than 16Gb of GPU memory available, I can share the xlsr model to try it out on your machine.

elgeish · 2021-03-12T20:27:55Z

I added a --max_duration_in_seconds filter. I'm seeing ok results now fine-tuning wav2vec2-base after 500 steps, for example:

reference: "wamin tilka ls~ilaE >al$~Ayu lS~iyniy~u wAlwaraqu wAlbAruwdu wAlbuwSilapu"
prediction: "wamin tiloka Als~ila>a$~aAyu AS~iyniy~u walowaraqu waAlobaAruwdu waAlobuwSilapu"

I've also started fine-tuning wav2vec2-large-xlsr-53 after adding missing files locally on my machine.

For future PRs, I'm thinking of:

Supporting other languages and examples from Patrick's awesome blog post, which I need to read
CER and WER transformations for other languages (e.g., ignoring tashkil in Arabic)
Supporting Lhotse datasets, which provide a ton of speech-specific functionality

Before adding UTs, I want to check with @patrickvonplaten the code here is on the right track.
@Getmany1 you've been super helpful, thank you!

src/transformers/models/wav2vec2/modeling_wav2vec2.py

src/transformers/models/wav2vec2/tokenization_wav2vec2.py

examples/research_projects/wav2vec2/run_asr.py

examples/research_projects/wav2vec2/vocab/buckwalter.json

patrickvonplaten

Hey @elgeish,

Thanks a lot for your PR! Sorry that I only reviewed it now. In general I like the changes you have made to run_wav2vec2.py :-)

A couple of things, I'd like to change:

Remove the change from modeling_wav2vec2.py. I don't really want a resize lm head method in the model there -> it's too much of an edge case IMO. You can simply first instantiate the processor then use the tokenizer length to load the model with Wav2Vec2ForCTC.from_pretrained(model_path, vocab_size=len(tokenizer)). This way we don't need a resize_lm_head() method.
Can you give me more context on the buckwalter.json file?
Can you also add more text to the README.md that explains a bit how to use the run_wav2vec2.py script? E.g. what does the orthography class do, what is the buckwalter vocab?

elgeish · 2021-03-17T07:05:22Z

Hey @elgeish,

Thanks a lot for your PR! Sorry that I only reviewed it now. In general I like the changes you have made to run_wav2vec2.py :-)

A couple of things, I'd like to change:

Remove the change from modeling_wav2vec2.py. I don't really want a resize lm head method in the model there -> it's too much of an edge case IMO. You can simply first instantiate the processor then use the tokenizer length to load the model with Wav2Vec2ForCTC.from_pretrained(model_path, vocab_size=len(tokenizer)). This way we don't need a resize_lm_head() method.

Can you give me more context on the buckwalter.json file?

Can you also add more text to the README.md that explains a bit how to use the run_wav2vec2.py script? E.g. what does the orthography class do, what is the buckwalter vocab?

Thank you! I responded inline. I'll update README.md as well. I think you mean run_asr.py, no?

patrickvonplaten · 2021-03-18T07:04:04Z

examples/research_projects/wav2vec2/README.md

 Feel free to ask a question on the [Forum](https://discuss.huggingface.co/) or post an issue on [GitHub](https://github.com/huggingface/transformers/issues/new/choose) and adding `@patrickvonplaten` as a tag.
+
+### Fine-Tuning with TIMIT


Very nice explanations - thanks for adding this!

My pleasure! Thanks for the review and nudges :)

patrickvonplaten · 2021-03-18T07:05:18Z

Great the PR looks good to me now! Thanks a lot for doing this :-) The failures seem unrelated, so I'll rerun the tests again to make the CI happy

* wav2vec2: support datasets other than LibriSpeech * Formatting run_asr.py to pass code quality test * bundled orthography options and added verbose logs * fixing a typo in timit fine-tuning script * update comment for clarity * resize_lm_head and load custom vocab from file * adding a max_duration_in_seconds filter * do not assign `duration_filter` lambda, use a def * log untransliterated text as well * fix base model for arabic * fix duration filter when target_sr is not set * drop duration_in_seconds when unneeded * script for wav2vec2-large-lv60-timit-asr * fix for "tha" in arabic corpus (huggingface#10581) * adding more options to work with common_voice * PR feedback (huggingface#10581) * small README change

elgeish added 2 commits March 7, 2021 16:38

wav2vec2: support datasets other than LibriSpeech

d75a0ba

Formatting run_asr.py to pass code quality test

35a9c9c

elgeish changed the title ~~wav2vec2: support datasets other than LibriSpeech~~ [WIP] wav2vec2: support datasets other than LibriSpeech Mar 8, 2021

LysandreJik assigned patrickvonplaten Mar 8, 2021

This was referenced Mar 9, 2021

[Timit_asr] Make sure not only the first sample is used huggingface/datasets#1995

Merged

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

Closed

bundled orthography options and added verbose logs

e72e6e5

This was referenced Mar 10, 2021

Unexpected result on finetuned model Wav2Vec2 (with TIMIT dataset) facebookresearch/fairseq#3043

Open

How to fine-tune wav2vec 2.0 with TIMIT facebookresearch/fairseq#2922

Closed

elgeish added 2 commits March 9, 2021 23:07

fixing a typo in timit fine-tuning script

58b7bb6

update comment for clarity

f2b98f8

elgeish changed the title ~~[WIP] wav2vec2: support datasets other than LibriSpeech~~ wav2vec2: support datasets other than LibriSpeech Mar 11, 2021

elgeish changed the title ~~wav2vec2: support datasets other than LibriSpeech~~ [WIP] wav2vec2: support datasets other than LibriSpeech Mar 12, 2021

resize_lm_head and load custom vocab from file

2c1ab34

adding a max_duration_in_seconds filter

67459a0

elgeish added 3 commits March 12, 2021 12:35

do not assign duration_filter lambda, use a def

22bf893

log untransliterated text as well

91fd0cf

fix base model for arabic

cfc0bd0

elgeish changed the title ~~[WIP] wav2vec2: support datasets other than LibriSpeech~~ wav2vec2: support datasets other than LibriSpeech Mar 14, 2021

elgeish added 4 commits March 13, 2021 22:40

fix duration filter when target_sr is not set

9f89789

drop duration_in_seconds when unneeded

e49f605

script for wav2vec2-large-lv60-timit-asr

8ee49e0

fix for "tha" in arabic corpus (huggingface#10581)

d35e62c