Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipformer recipe for ReazonSpeech #1611

Merged
merged 51 commits into from
Jun 13, 2024
Merged

Conversation

Triplecq
Copy link
Contributor

@Triplecq Triplecq commented May 2, 2024

ReazonSpeech is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.

The dataset is available on Hugging Face. For more details, please visit:

Triplecq and others added 30 commits October 3, 2023 09:43
I needed this in order to pull unreleased fixes. The last tagged version
was too old (dated back in Jul 2023), and not compatible with recent
lhotse releases.
This recipe is mostly based on egs/csj, but tweaked to the point that
can be run with ReazonSpeech corpus.

That being said, there are some big caveats:

 * Currently the model quality is not very good. Actually, it is very
   bad. I trained a model with 1000h corpus, and it resulted in >80%
   CER on JSUT.

 * The core issue seems that Zipformer is prone to ignore untterances
   as sielent segments. It often produces an empty hypothesis despite
   that the audio actually contains human voice.

 * This issue is already reported in the upstream and not fully
   resolved yet as of Dec 2023.

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
…/icefall

Experimental version for ReazonSpeech
@danpovey
Copy link
Collaborator

danpovey commented May 2, 2024

There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR.

@Triplecq
Copy link
Contributor Author

Triplecq commented May 2, 2024

There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR.

Thanks for your quick feedback during the holiday! I will remove unrelated changes and get back to you soon.

@Triplecq
Copy link
Contributor Author

Triplecq commented May 2, 2024

I've already removed those unrelated changes. It's ready for review now. Please let me know if you have any questions or comments. Thank you!

@pzelasko
Copy link
Collaborator

pzelasko commented May 2, 2024

I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well?

@Triplecq
Copy link
Contributor Author

Triplecq commented May 2, 2024

I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well?

Thanks for the note. Sure, we're cleaning up the scripts and will submit a PR to Lhotse soon. :)

@Triplecq
Copy link
Contributor Author

Triplecq commented May 2, 2024

I just submitted a PR to Lhotse as well: lhotse-speech/lhotse#1330
Both PR are ready for review. Thank you!

@yfyeung
Copy link
Collaborator

yfyeung commented Jun 7, 2024

Hi, May I kindly ask for some questions?

What are the main differences in quality and coverage between the small, medium, large, and all sets?
Which configuration (large, all, or small+medium+large) yields the best performance?

Thanks for your assistance.

@Triplecq
Copy link
Contributor Author

Triplecq commented Jun 7, 2024

Hi @yfyeung

Thank you for your interest and questions.

As far as I know, the various partitions only differ in their size and hours, as listed in the table on the Hugging Face page. (@fujimotos san could you please confirm this or correct me if I am wrong? Thank you!)

Here is a comparison of different partitions:

Model Name Model Size In-Distribution CER JSUT CER CommonVoice CER TEDx CER
zipformer-L (medium) 155.92 M 10.31 16.52 12.8 28.8
zipformer-L (large) 157.24 M 6.19 10.35 9.36 24.23
zipformer-L (all) 159.34 M 4.2 (epoch 39 avg 7) 6.62 (epoch 39 avg 2) 7.76 (epoch 39 avg 2) 17.81 (epoch 39 avg 10)

P.S., With this larger setting of the zipformer model, we suggest using data with more than 300 hours. We have not tried the combination of small + medium + large together, but I assume the performance is basically determined by the hours of your data.

I hope this helps. Feel free to let me know if you have any other questions. Good luck and have fun with this recipe. :)

@fujimotos
Copy link
Contributor

What are the main differences in quality and coverage between the small, medium, large, and all sets?

The only difference is their dataset sizes. Check out this table:

Name Size Hours
tiny 600MB 8.5 hours
small 6GB 100 hours
medium 65GB 1000 hours
large 330GB 5000 hours
all 2.3TB 35000 hours

Which configuration (large, all, or small+medium+large) yields the best performance?

Use all for the best performance. Other splits (tiny/small/medium/large)
are subsets of the all set.

Note: In case there is some confusion, the relationship of those sets is:

tiny ⊆small ⊆ medium ⊆ large ⊆ all

Copy link
Collaborator

@JinZr JinZr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is ready to be merged, thank you for your PR!

@JinZr JinZr merged commit 3b40d9b into k2-fsa:master Jun 13, 2024
203 checks passed
@csukuangfj
Copy link
Collaborator

Please upload links to pre-trained models in a separate PR.

@yujinqiu
Copy link

@Triplecq @fujimotos
Is the model ready to share ?

Please upload links to pre-trained models in a separate PR.

@Triplecq
Copy link
Contributor Author

@yujinqiu Thanks for your patience! We just completed another validation test on JSUT-book before the release. I will submit a separate PR and get you updated once we open the model.

@yuyun2000
Copy link

yuyun2000 commented Jun 18, 2024

This may be the world's number one Japanese language recognition model. If you could create a medium stream version of the model, it would be the number one in the universe!

yfyeung pushed a commit to yfyeung/icefall that referenced this pull request Aug 9, 2024
* Add first cut at ReazonSpeech recipe

This recipe is mostly based on egs/csj, but tweaked to the point that
can be run with ReazonSpeech corpus.

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>

---------

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
Co-authored-by: Fujimoto Seiji <fujimoto@ceptord.net>
Co-authored-by: Chen <qc@KDM00.cm.cluster>
Co-authored-by: root <root@KDA01.cm.cluster>
@sangeet2020
Copy link

Hi @Triplecq
was wondering if the model is available for sharing on HF?
thanks

@yuyun2000
Copy link

He has shared the weights, you can see the Japanese model in the document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants