Zipformer recipe for ReazonSpeech #1611

Triplecq · 2024-05-02T00:32:03Z

ReazonSpeech is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.

The dataset is available on Hugging Face. For more details, please visit:

I needed this in order to pull unreleased fixes. The last tagged version was too old (dated back in Jul 2023), and not compatible with recent lhotse releases.

This recipe is mostly based on egs/csj, but tweaked to the point that can be run with ReazonSpeech corpus. That being said, there are some big caveats: * Currently the model quality is not very good. Actually, it is very bad. I trained a model with 1000h corpus, and it resulted in >80% CER on JSUT. * The core issue seems that Zipformer is prone to ignore untterances as sielent segments. It often produces an empty hypothesis despite that the audio actually contains human voice. * This issue is already reported in the upstream and not fully resolved yet as of Dec 2023. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>

…/icefall Experimental version for ReazonSpeech

…s too small, exiting: 5.820766091346741e-11

danpovey · 2024-05-02T05:01:47Z

There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR.

Triplecq · 2024-05-02T08:26:27Z

There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR.

Thanks for your quick feedback during the holiday! I will remove unrelated changes and get back to you soon.

Triplecq · 2024-05-02T10:25:02Z

I've already removed those unrelated changes. It's ready for review now. Please let me know if you have any questions or comments. Thank you!

pzelasko · 2024-05-02T13:56:39Z

I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well?

Triplecq · 2024-05-02T16:15:41Z

I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well?

Thanks for the note. Sure, we're cleaning up the scripts and will submit a PR to Lhotse soon. :)

Triplecq · 2024-05-02T16:57:03Z

I just submitted a PR to Lhotse as well: lhotse-speech/lhotse#1330
Both PR are ready for review. Thank you!

yfyeung · 2024-06-07T07:59:10Z

Hi, May I kindly ask for some questions?

What are the main differences in quality and coverage between the small, medium, large, and all sets?
Which configuration (large, all, or small+medium+large) yields the best performance?

Thanks for your assistance.

Triplecq · 2024-06-07T23:32:22Z

Hi @yfyeung

Thank you for your interest and questions.

As far as I know, the various partitions only differ in their size and hours, as listed in the table on the Hugging Face page. (@fujimotos san could you please confirm this or correct me if I am wrong? Thank you!)

Here is a comparison of different partitions:

Model Name	Model Size	In-Distribution CER	JSUT CER	CommonVoice CER	TEDx CER
zipformer-L (medium)	155.92 M	10.31	16.52	12.8	28.8
zipformer-L (large)	157.24 M	6.19	10.35	9.36	24.23
zipformer-L (all)	159.34 M	4.2 (epoch 39 avg 7)	6.62 (epoch 39 avg 2)	7.76 (epoch 39 avg 2)	17.81 (epoch 39 avg 10)

P.S., With this larger setting of the zipformer model, we suggest using data with more than 300 hours. We have not tried the combination of small + medium + large together, but I assume the performance is basically determined by the hours of your data.

I hope this helps. Feel free to let me know if you have any other questions. Good luck and have fun with this recipe. :)

fujimotos · 2024-06-08T04:58:16Z

What are the main differences in quality and coverage between the small, medium, large, and all sets?

The only difference is their dataset sizes. Check out this table:

Name	Size	Hours
`tiny`	600MB	8.5 hours
`small`	6GB	100 hours
`medium`	65GB	1000 hours
`large`	330GB	5000 hours
`all`	2.3TB	35000 hours

Which configuration (large, all, or small+medium+large) yields the best performance?

Use all for the best performance. Other splits (tiny/small/medium/large)
are subsets of the all set.

Note: In case there is some confusion, the relationship of those sets is:

tiny ⊆small ⊆ medium ⊆ large ⊆ all

JinZr

I think this one is ready to be merged, thank you for your PR!

csukuangfj · 2024-06-13T06:42:58Z

Please upload links to pre-trained models in a separate PR.

yujinqiu · 2024-06-13T09:51:21Z

@Triplecq @fujimotos
Is the model ready to share ?

Please upload links to pre-trained models in a separate PR.

Triplecq · 2024-06-14T14:43:58Z

@yujinqiu Thanks for your patience! We just completed another validation test on JSUT-book before the release. I will submit a separate PR and get you updated once we open the model.

yuyun2000 · 2024-06-18T09:12:17Z

This may be the world's number one Japanese language recognition model. If you could create a medium stream version of the model, it would be the number one in the universe!

* Add first cut at ReazonSpeech recipe This recipe is mostly based on egs/csj, but tweaked to the point that can be run with ReazonSpeech corpus. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> --------- Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> Co-authored-by: Fujimoto Seiji <fujimoto@ceptord.net> Co-authored-by: Chen <qc@KDM00.cm.cluster> Co-authored-by: root <root@KDA01.cm.cluster>

sangeet2020 · 2024-10-07T10:32:28Z

Hi @Triplecq
was wondering if the model is available for sharing on HF?
thanks

yuyun2000 · 2024-10-08T01:21:34Z

He has shared the weights, you can see the Japanese model in the document

Triplecq and others added 30 commits October 3, 2023 09:43

test icefall with yesno

a93aece

Merge branch 'master' of github.com:Triplecq/icefall

26ee4c3

Merge latest commit 'b0f70c9' on k2-fsa/icefall

16c02cf

I needed this in order to pull unreleased fixes. The last tagged version was too old (dated back in Jul 2023), and not compatible with recent lhotse releases.

Merge branch 'k2-fsa:master' into master

a82e001

Merge tag 'rs-experiment' of kdm00:/mnt/syno128/volume1/fujimotos/git…

abbee87

…/icefall Experimental version for ReazonSpeech

Zipformer recipe

2436597

init zipformer recipe

af87726

Add pruned_transducer_stateless2 from reazonspeech branch

8eae6ec

customize tranning script for rs

5e9a171

restore

1e6fe2e

customized recipes for reazonspeech

b1de6f2

customized recipes for rs

dc2d531

Merge branch 'master' of github.com:Triplecq/icefall

819db8f

Merge branch 'master' into rs

ced8a53

decrease learning-rate to solve the error: RuntimeError: grad_scale i…

42c152f

…s too small, exiting: 5.820766091346741e-11

traning script completed

04fa9e3

customize decoding script

7b6a897

comment out params related to the chunk size

77178c6

all combinations of epochs and avgs

a8e9dc2

add blank penalty in decoding script

f35fa8a

validation scripts

d864da4

prepare for 1000h dataset

5d94a19

complete exp on zipformer-L

860a6b2

validation test

03e8cfa

update graph

456241b

complete validation

5e7db1a

delete graph

baf6ebb

update graph

1e25c96

update graph

3b36a67

root and others added 5 commits May 2, 2024 07:03

remove unnecessary files

1050455

remove outdated recipes

45a1225

Update README.md

d61b739

format files with isort to meet style guidelines

0925a0c

format files with isort to meet style guidelines

97c9311

root added 3 commits May 2, 2024 19:13

remove unrelated changes

193470c

add back necessary docs

8edd9bd

remove unrelated changes

f8707d7

Triplecq added 3 commits May 19, 2024 19:01

Add download method to prepare.sh

2507918

Fix cuts file path

e39f56e

Change valid to dev for consistency

777f7a4

JinZr approved these changes Jun 13, 2024

View reviewed changes

JinZr merged commit 3b40d9b into k2-fsa:master Jun 13, 2024
203 checks passed

yujinqiu mentioned this pull request Jun 26, 2024

Add Streaming Zipformer-Transducer recipe for KsponSpeech #1651

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipformer recipe for ReazonSpeech #1611

Zipformer recipe for ReazonSpeech #1611

Triplecq commented May 2, 2024

danpovey commented May 2, 2024

Triplecq commented May 2, 2024

Triplecq commented May 2, 2024

pzelasko commented May 2, 2024

Triplecq commented May 2, 2024

Triplecq commented May 2, 2024

yfyeung commented Jun 7, 2024

Triplecq commented Jun 7, 2024

fujimotos commented Jun 8, 2024

JinZr left a comment

csukuangfj commented Jun 13, 2024

yujinqiu commented Jun 13, 2024

Triplecq commented Jun 14, 2024

yuyun2000 commented Jun 18, 2024 •

edited

Loading

sangeet2020 commented Oct 7, 2024

yuyun2000 commented Oct 8, 2024

Zipformer recipe for ReazonSpeech #1611

Zipformer recipe for ReazonSpeech #1611

Conversation

Triplecq commented May 2, 2024

danpovey commented May 2, 2024

Triplecq commented May 2, 2024

Triplecq commented May 2, 2024

pzelasko commented May 2, 2024

Triplecq commented May 2, 2024

Triplecq commented May 2, 2024

yfyeung commented Jun 7, 2024

Triplecq commented Jun 7, 2024

fujimotos commented Jun 8, 2024

JinZr left a comment

Choose a reason for hiding this comment

csukuangfj commented Jun 13, 2024

yujinqiu commented Jun 13, 2024

Triplecq commented Jun 14, 2024

yuyun2000 commented Jun 18, 2024 • edited Loading

sangeet2020 commented Oct 7, 2024

yuyun2000 commented Oct 8, 2024

yuyun2000 commented Jun 18, 2024 •

edited

Loading