-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors while preparing Common Voice dataset #1531
Comments
Looks like there’s an issue with the text you use to train the bpe model, could you show us whats the text looked like?
… On Wednesday, Mar 06, 2024 at 19:25, ChrystianKacki ***@***.*** ***@***.***)> wrote:
I'm trying do build zipformer model with Common Voice dataset.
I downloaded the data manually, because pointing URL is now forbidden and requires to be logged in.
When I execute script prepare.sh located in https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR, at Stage 9: Prepare BPE based lang get the following error:
RuntimeError: Internal: src/trainer_interface.cc(662) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (500). Please set it to a value <= 6.
Traceback (most recent call last):
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 115, in
main()
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 93, in main
spm.SentencePieceTrainer.train(
File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1047, in Train
SentencePieceTrainer._Train(arg=arg, **kwargs)
File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1040, in _Train
return SentencePieceTrainer._TrainFromMap(new_kwargs)
File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 985, in _TrainFromMap
return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
From the traceback, I think this occurs because vocab_sizes variable at the begining of script prepare.sh is set to 500, and then it is passed to train_bpe_model.py script. From here, the code in the method SentencePieceTrainer.train() of sentencepiece library throws this error.
How to deal with this ? Can I simply change the vocab_sizes variable value to 6 in prepare.sh script ?
—
Reply to this email directly, view it on GitHub (#1531), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42C4V2BEGD5O4D3PTP3YW34KZAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3TCMRXHAZDOMI).
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
@JinZr Do you mean the text in file
|
Yes
… On Wednesday, Mar 06, 2024 at 20:24, ChrystianKacki ***@***.*** ***@***.***)> wrote:
@JinZr (https://github.com/JinZr) Do you mean the text in file $lang_dir/transcript_words.txt ?
>
> Looks like there’s an issue with the text you use to train the bpe model, could you show us whats the text looked like?
>
—
Reply to this email directly, view it on GitHub (#1531 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42HY5MQC3FVW7POYYZTYW4DIPAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQG42TMNRXGI).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@JinZr Please see the attached file: transcript_words.txt.
|
then there must be something wrong with the data processing part, this file is supposed to contain all transcripts of the dataset
… On Wednesday, Mar 06, 2024 at 20:44, ChrystianKacki ***@***.*** ***@***.***)> wrote:
@JinZr (https://github.com/JinZr) Please see the attached file: transcript_words.txt (https://github.com/k2-fsa/icefall/files/14510019/transcript_words.txt).
Should the text file contain something or can be empty ?
Now it contains only one word "pl".
I have to say, that I've changed the Common Voice dataset in script prepare.sh from French to Polish.
I modified variables at the beginning of the script to:
release=cv-corpus-16.1-2023-12-06 lang=pl
—
Reply to this email directly, view it on GitHub (#1531 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42BPHZT6HYAJEFXW263YW4FUPAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQG44TANZYGQ).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes you are right. I've found a solution. It could be helpful for those who want to adapt the Common Voice model receipt to different languages than English and French. In
|
Thank you so much!
this would be very useful for the recipe, let me look into it tomorrow!
… On Wednesday, Mar 06, 2024 at 22:21, ChrystianKacki ***@***.*** ***@***.***)> wrote:
Yes you are right. I've found a solution.
It could be helpful for those who want to adapt the Common Voice model receipt to different languages than English and French.
In prepare.sh at Stage 3: Preprocess CommonVoice manifest there is script local/preprocess_commonvoice.py executed.
In method normalize_text() of this script the following code should be added, according to used language:
if language == "pl": return re.sub(r"[^a-ząćęłńóśźżA-ZĄĆĘŁŃÓŚŹŻ' ]", "", utt).upper()
—
Reply to this email directly, view it on GitHub (#1531 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42FUUKRNBCEGRFCRKWTYW4Q43AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQHE3TQOBWGQ).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There are also another 2 errors in stages:
To fix these errors you need to modify scripts: The code to replace in scripts
|
hi, do you mind checking this PR #1532 and see if it fixes the text norm issue? thanks! |
I've seen the commit #1532, and the code from it works OK in my environment. There are also another two errors later: in stages 11 and 12 of the script.
|
thanks! i’ll merge the #1532 first and fix the other two with a separate
commit.
Best Regards
Jin
…On Thu, 7 Mar 2024 at 18:35 ChrystianKacki ***@***.***> wrote:
I've seen the commit #1532 <#1532>,
and the code from it works OK in my environment.
So now the script prepare.sh goes clearly through stage 9.
There are also another two errors later: in stages 11 and 12 of the script.
Please see the comment above
<#1531 (comment)>,
which describes how to fix them.
hi, do you mind checking this PR #1532
<#1532> and see if it fixes the
text norm issue?
thanks!
—
Reply to this email directly, view it on GitHub
<#1531 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42FITCMUXBV7DCPWMZTYXA7GNAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGIZDAMBXGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks! Happy to contribute :-) Best Regards |
I've updated the comment about the two new errors with source code to replace. |
thank you!
please check the updated scripts in pr #1534, see if this fixes the
problem.
…On Thu, Mar 7, 2024 at 7:13 PM ChrystianKacki ***@***.***> wrote:
I've updated the comment about the two new errors
<#1531 (comment)>
with source code to replace.
—
Reply to this email directly, view it on GitHub
<#1531 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42G5RZG42AT6A47WWLTYXBDW5AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGI4DQMBXGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I checked both scripts and they work correctly. |
thanks! my colleagues will do the code review tmr and merge the pr.
🎉🎉
Best Regards
Jin
…On Thu, 7 Mar 2024 at 19:49 ChrystianKacki ***@***.***> wrote:
I checked both scripts and they work correctly.
So now prepare.sh goes through all stages without any errors.
thank you! please check the updated scripts in pr #1534
<#1534>, see if this fixes the
problem.
… <#m_4353255266802977736_>
On Thu, Mar 7, 2024 at 7:13 PM ChrystianKacki *@*.*> wrote: I've updated
the comment about the two new errors <#1531 (comment)
<#1531 (comment)>>
with source code to replace. — Reply to this email directly, view it on
GitHub <#1531 (comment)
<#1531 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AOON42G5RZG42AT6A47WWLTYXBDW5AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGI4DQMBXGI
<https://github.com/notifications/unsubscribe-auth/AOON42G5RZG42AT6A47WWLTYXBDW5AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGI4DQMBXGI>
. You are receiving this because you were mentioned.Message ID: @.*>
—
Reply to this email directly, view it on GitHub
<#1531 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42C3YGU4JAMBKVQ7HKDYXBH47AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGM2DINJWGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
* init commit * fix for issue #1531 * minor fixes
I'm trying to build zipformer model with Common Voice dataset.
I'm using script
prepare.sh
, located in https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR.I downloaded the data manually, because pointing URL is now forbidden and requires to submit e-mail address or to be logged in.
When I execute the script, at
Stage 9: Prepare BPE based lang
get the following error:RuntimeError: Internal: src/trainer_interface.cc(662) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (500). Please set it to a value <= 6.
Traceback (most recent call last):
main()
spm.SentencePieceTrainer.train(
SentencePieceTrainer._Train(arg=arg, **kwargs)
return SentencePieceTrainer._TrainFromMap(new_kwargs)
return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
From the traceback, I think this occurs because
vocab_sizes
variable at the begining of scriptprepare.sh
is set to500
, and then it is passed totrain_bpe_model.py
script. From here, the code in the methodSentencePieceTrainer.train()
ofsentencepiece
library throws this error.How to deal with this ? Can I simply change the
vocab_sizes
variable value to6
inprepare.sh
script ?The text was updated successfully, but these errors were encountered: