Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper large fine-tuning on wenetspeech, mutli-hans-zh #1483

Merged
merged 50 commits into from
Mar 7, 2024

Conversation

yuekaizhang
Copy link
Collaborator

@yuekaizhang yuekaizhang commented Jan 31, 2024

This PR adds:

Model SpeechIO 001-026 Avg WER Comment
zrjin/icefall-asr-multi-zh-hans-zipformer-ctc-2023-10-24 7.48% Zipformer trained on 14k hours ZH data
whisper-large-v2-wenetspeech 8.01% large-v2 fine-tuned with wenetspeech 10k hours, suffer deletion errors: see wenet-e2e/WenetSpeech#54
whisper-large-v2-wenetspeech + zipformer 6.93% Utilize zipformer to reduce deletion errors: see here
  • Whisper fine-tuning recipe on wenetspeech, multi-hans-zh

Following PR TODOs:

  • wenetspeech/whisper RESULTS.md update
  • multi-hans-zh/whisper RESULTS.md update

@yuekaizhang
Copy link
Collaborator Author

Not sure about deletion errors, but the key aspect of whisper is that it is multiobjective both transcription and translation. Translation task is not just for fun, it helps model to factorize out grammar and sense aspect and improve ASR accuracy.

If you want to get a good numbers on high-resource language, you also need multi-objective, at least translation task. Ideally speakerid objective too.

Good point. It would be great if you know some experiment results or papers which use multi-objective to fine-tune whisper.

@marcoyang1998
Copy link
Collaborator

Have other people reported similar observations when fine-tuning Whisper on WenetSpeech?

Similarly, we experienced severe deletion errors while training Zipformer on WenetSpeech (see #1130), could this be the problem of the dataset?

@yuekaizhang yuekaizhang changed the title [WIP] whisper large fine-tuning on wenetspeech, mutli-hans-zh Whisper large fine-tuning on wenetspeech, mutli-hans-zh Mar 7, 2024
@yuekaizhang yuekaizhang requested a review from JinZr March 7, 2024 07:10
@JinZr
Copy link
Collaborator

JinZr commented Mar 7, 2024

thanks! i'll look into it.

@yuekaizhang
Copy link
Collaborator Author

Have other people reported similar observations when fine-tuning Whisper on WenetSpeech?

Similarly, we experienced severe deletion errors while training Zipformer on WenetSpeech (see #1130), could this be the problem of the dataset?

@marcoyang1998 You're correct, see wenet-e2e/WenetSpeech#54.

One solution is to retrain with the new labels provided by wenet-e2e/WenetSpeech#54, but I'm thinking that for such colloquial scenarios, there might be a better way to evaluate, such as reducing the weight of errors related to modal particles.

When people want to use ASR to add subtitles to their videos, it's clear that it would be more helpful if the model could automatically omit these colloquial words.

@JinZr
Copy link
Collaborator

JinZr commented Mar 7, 2024

hi, thank you for your work!

i went through the pr and left a comment and few modifications, if you feel like those are proper then i think this pr is ready to merge.

Copy link
Collaborator

@JinZr JinZr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, waiting for CI tests to be done

thanks!

@JinZr JinZr merged commit 5df24c1 into k2-fsa:master Mar 7, 2024
108 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants