-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipformer recipe for ReazonSpeech #1611
Conversation
I needed this in order to pull unreleased fixes. The last tagged version was too old (dated back in Jul 2023), and not compatible with recent lhotse releases.
This recipe is mostly based on egs/csj, but tweaked to the point that can be run with ReazonSpeech corpus. That being said, there are some big caveats: * Currently the model quality is not very good. Actually, it is very bad. I trained a model with 1000h corpus, and it resulted in >80% CER on JSUT. * The core issue seems that Zipformer is prone to ignore untterances as sielent segments. It often produces an empty hypothesis despite that the audio actually contains human voice. * This issue is already reported in the upstream and not fully resolved yet as of Dec 2023. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
…/icefall Experimental version for ReazonSpeech
…s too small, exiting: 5.820766091346741e-11
There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR. |
Thanks for your quick feedback during the holiday! I will remove unrelated changes and get back to you soon. |
I've already removed those unrelated changes. It's ready for review now. Please let me know if you have any questions or comments. Thank you! |
I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well? |
Thanks for the note. Sure, we're cleaning up the scripts and will submit a PR to Lhotse soon. :) |
I just submitted a PR to Lhotse as well: lhotse-speech/lhotse#1330 |
Hi, May I kindly ask for some questions? What are the main differences in quality and coverage between the Thanks for your assistance. |
Hi @yfyeung Thank you for your interest and questions. As far as I know, the various partitions only differ in their size and hours, as listed in the table on the Hugging Face page. (@fujimotos san could you please confirm this or correct me if I am wrong? Thank you!) Here is a comparison of different partitions:
P.S., With this larger setting of the zipformer model, we suggest using data with more than 300 hours. We have not tried the combination of I hope this helps. Feel free to let me know if you have any other questions. Good luck and have fun with this recipe. :) |
The only difference is their dataset sizes. Check out this table:
Use Note: In case there is some confusion, the relationship of those sets is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is ready to be merged, thank you for your PR!
Please upload links to pre-trained models in a separate PR. |
@Triplecq @fujimotos
|
@yujinqiu Thanks for your patience! We just completed another validation test on |
This may be the world's number one Japanese language recognition model. If you could create a medium stream version of the model, it would be the number one in the universe! |
* Add first cut at ReazonSpeech recipe This recipe is mostly based on egs/csj, but tweaked to the point that can be run with ReazonSpeech corpus. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> --------- Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> Co-authored-by: Fujimoto Seiji <fujimoto@ceptord.net> Co-authored-by: Chen <qc@KDM00.cm.cluster> Co-authored-by: root <root@KDA01.cm.cluster>
Hi @Triplecq |
He has shared the weights, you can see the Japanese model in the document |
ReazonSpeech is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.
The dataset is available on Hugging Face. For more details, please visit: