Errors while preparing Common Voice dataset #1531

ChrystianKacki · 2024-03-06T11:25:22Z

I'm trying to build zipformer model with Common Voice dataset.
I'm using script prepare.sh, located in https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR.
I downloaded the data manually, because pointing URL is now forbidden and requires to submit e-mail address or to be logged in.
When I execute the script, at Stage 9: Prepare BPE based lang get the following error:

RuntimeError: Internal: src/trainer_interface.cc(662) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (500). Please set it to a value <= 6.

Traceback (most recent call last):

File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 115, in
main()
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 93, in main
spm.SentencePieceTrainer.train(
File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1047, in Train
SentencePieceTrainer._Train(arg=arg, **kwargs)
File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1040, in _Train
return SentencePieceTrainer._TrainFromMap(new_kwargs)
File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 985, in _TrainFromMap
return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)

From the traceback, I think this occurs because vocab_sizes variable at the begining of script prepare.sh is set to 500, and then it is passed to train_bpe_model.py script. From here, the code in the method SentencePieceTrainer.train() of sentencepiece library throws this error.
How to deal with this ? Can I simply change the vocab_sizes variable value to 6 in prepare.sh script ?

The text was updated successfully, but these errors were encountered:

JinZr · 2024-03-06T12:15:08Z

Looks like there’s an issue with the text you use to train the bpe model, could you show us whats the text looked like?

…

On Wednesday, Mar 06, 2024 at 19:25, ChrystianKacki ***@***.*** ***@***.***)> wrote: I'm trying do build zipformer model with Common Voice dataset. I downloaded the data manually, because pointing URL is now forbidden and requires to be logged in. When I execute script prepare.sh located in https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR, at Stage 9: Prepare BPE based lang get the following error: RuntimeError: Internal: src/trainer_interface.cc(662) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (500). Please set it to a value <= 6. Traceback (most recent call last): File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 115, in main() File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 93, in main spm.SentencePieceTrainer.train( File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1047, in Train SentencePieceTrainer._Train(arg=arg, **kwargs) File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1040, in _Train return SentencePieceTrainer._TrainFromMap(new_kwargs) File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 985, in _TrainFromMap return _sentencepiece.SentencePieceTrainer__TrainFromMap(args) From the traceback, I think this occurs because vocab_sizes variable at the begining of script prepare.sh is set to 500, and then it is passed to train_bpe_model.py script. From here, the code in the method SentencePieceTrainer.train() of sentencepiece library throws this error. How to deal with this ? Can I simply change the vocab_sizes variable value to 6 in prepare.sh script ? — Reply to this email directly, view it on GitHub (#1531), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42C4V2BEGD5O4D3PTP3YW34KZAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3TCMRXHAZDOMI). You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ChrystianKacki · 2024-03-06T12:24:27Z

@JinZr Do you mean the text in file $lang_dir/transcript_words.txt ?

Looks like there’s an issue with the text you use to train the bpe model, could you show us whats the text looked like?

JinZr · 2024-03-06T12:25:56Z

Yes

…

On Wednesday, Mar 06, 2024 at 20:24, ChrystianKacki ***@***.*** ***@***.***)> wrote: @JinZr (https://github.com/JinZr) Do you mean the text in file $lang_dir/transcript_words.txt ? > > Looks like there’s an issue with the text you use to train the bpe model, could you show us whats the text looked like? > — Reply to this email directly, view it on GitHub (#1531 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42HY5MQC3FVW7POYYZTYW4DIPAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQG42TMNRXGI). You are receiving this because you were mentioned.Message ID: ***@***.***>

ChrystianKacki · 2024-03-06T12:44:43Z

@JinZr Please see the attached file: transcript_words.txt.
Should the text file contain something or can be empty ?
Now it contains only one word "pl".
I have to say, that I've changed the Common Voice dataset in script prepare.sh from French to Polish.
I modified variables at the beginning of the script to:

release=cv-corpus-16.1-2023-12-06
lang=pl

JinZr · 2024-03-06T12:47:26Z

then there must be something wrong with the data processing part, this file is supposed to contain all transcripts of the dataset

…

On Wednesday, Mar 06, 2024 at 20:44, ChrystianKacki ***@***.*** ***@***.***)> wrote: @JinZr (https://github.com/JinZr) Please see the attached file: transcript_words.txt (https://github.com/k2-fsa/icefall/files/14510019/transcript_words.txt). Should the text file contain something or can be empty ? Now it contains only one word "pl". I have to say, that I've changed the Common Voice dataset in script prepare.sh from French to Polish. I modified variables at the beginning of the script to: release=cv-corpus-16.1-2023-12-06 lang=pl — Reply to this email directly, view it on GitHub (#1531 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42BPHZT6HYAJEFXW263YW4FUPAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQG44TANZYGQ). You are receiving this because you were mentioned.Message ID: ***@***.***>

ChrystianKacki · 2024-03-06T14:20:48Z

Yes you are right. I've found a solution.

It could be helpful for those who want to adapt the Common Voice model receipt to different languages than English and French.

In prepare.sh at Stage 3: Preprocess CommonVoice manifest there is script local/preprocess_commonvoice.py executed.
In method normalize_text() of this script the following code should be added, according to used language:

if language == "pl":
return re.sub(r"[^a-ząćęłńóśźżA-ZĄĆĘŁŃÓŚŹŻ' ]", "", utt).upper()

JinZr · 2024-03-06T14:27:57Z

Thank you so much! this would be very useful for the recipe, let me look into it tomorrow!

…

On Wednesday, Mar 06, 2024 at 22:21, ChrystianKacki ***@***.*** ***@***.***)> wrote: Yes you are right. I've found a solution. It could be helpful for those who want to adapt the Common Voice model receipt to different languages than English and French. In prepare.sh at Stage 3: Preprocess CommonVoice manifest there is script local/preprocess_commonvoice.py executed. In method normalize_text() of this script the following code should be added, according to used language: if language == "pl": return re.sub(r"[^a-ząćęłńóśźżA-ZĄĆĘŁŃÓŚŹŻ' ]", "", utt).upper() — Reply to this email directly, view it on GitHub (#1531 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AOON42FUUKRNBCEGRFCRKWTYW4Q43AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQHE3TQOBWGQ). You are receiving this because you were mentioned.Message ID: ***@***.***>

ChrystianKacki · 2024-03-06T14:38:41Z

There are also another 2 errors in stages: Stage 11: Compile HLG and Stage 12: Compile LG:

FileNotFoundError: [Errno 2] No such file or directory: 'data/lm/G_3_gram.fst.txt'
Traceback (most recent call last):
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_hlg.py", line 171, in
main()
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_hlg.py", line 161, in main
HLG = compile_HLG(lang_dir, args.lm)
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_hlg.py", line 85, in compile_HLG
with open(f"data/lm/{lm}.fst.txt") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/lm/G_3_gram.fst.txt
Traceback (most recent call last):
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_lg.py", line 151, in
main()
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_lg.py", line 141, in main
LG = compile_LG(lang_dir, args.lm)
File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_lg.py", line 79, in compile_LG
with open(f"data/lm/{lm}.fst.txt") as f:

To fix these errors you need to modify scripts: local/compile_hlg.py and local/compile_lg.py.
In methods compile_HLG() and compile_LG() you have to replace all path strings beginning with "data/lm/..." to {lang_dir}/lm/....

The code to replace in scripts compile_hlg.py and compile_lg.py is the same. It has to be done in methods compile_HLG() and compile_LG():

if Path(f"{lang_dir}/lm/{lm}.pt").is_file():
    logging.info(f"Loading pre-compiled {lm}")
    d = torch.load(f"{lang_dir}/lm/{lm}.pt")
    G = k2.Fsa.from_dict(d)
else:
    logging.info(f"Loading {lm}.fst.txt")
    with open(f"{lang_dir}/lm/{lm}.fst.txt") as f:
        G = k2.Fsa.from_openfst(f.read(), acceptor=False)
        torch.save(G.as_dict(), f"{lang_dir}/lm/{lm}.pt")

JinZr · 2024-03-06T17:22:33Z

hi, do you mind checking this PR #1532 and see if it fixes the text norm issue?

thanks!

ChrystianKacki · 2024-03-07T10:34:56Z

I've seen the commit #1532, and the code from it works OK in my environment.
So now the script prepare.sh goes clearly through stage 9.

There are also another two errors later: in stages 11 and 12 of the script.
Please see the comment above, which describes how to fix them.

hi, do you mind checking this PR #1532 and see if it fixes the text norm issue?

thanks!

JinZr · 2024-03-07T10:42:37Z

thanks! i’ll merge the #1532 first and fix the other two with a separate commit. Best Regards Jin

…

On Thu, 7 Mar 2024 at 18:35 ChrystianKacki ***@***.***> wrote: I've seen the commit #1532 <#1532>, and the code from it works OK in my environment. So now the script prepare.sh goes clearly through stage 9. There are also another two errors later: in stages 11 and 12 of the script. Please see the comment above <#1531 (comment)>, which describes how to fix them. hi, do you mind checking this PR #1532 <#1532> and see if it fixes the text norm issue? thanks! — Reply to this email directly, view it on GitHub <#1531 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42FITCMUXBV7DCPWMZTYXA7GNAVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGIZDAMBXGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ChrystianKacki · 2024-03-07T10:47:54Z

Thanks! Happy to contribute :-)

Best Regards
Chrystian

ChrystianKacki · 2024-03-07T11:13:28Z

I've updated the comment about the two new errors with source code to replace.

JinZr · 2024-03-07T11:36:47Z

thank you! please check the updated scripts in pr #1534, see if this fixes the problem.

…

On Thu, Mar 7, 2024 at 7:13 PM ChrystianKacki ***@***.***> wrote: I've updated the comment about the two new errors <#1531 (comment)> with source code to replace. — Reply to this email directly, view it on GitHub <#1531 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42G5RZG42AT6A47WWLTYXBDW5AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGI4DQMBXGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ChrystianKacki · 2024-03-07T11:49:14Z

I checked both scripts and they work correctly.
So now prepare.sh goes through all stages without any errors.

JinZr · 2024-03-07T11:50:59Z

thanks! my colleagues will do the code review tmr and merge the pr. 🎉🎉 Best Regards Jin

…

On Thu, 7 Mar 2024 at 19:49 ChrystianKacki ***@***.***> wrote: I checked both scripts and they work correctly. So now prepare.sh goes through all stages without any errors. thank you! please check the updated scripts in pr #1534 <#1534>, see if this fixes the problem. … <#m_4353255266802977736_> On Thu, Mar 7, 2024 at 7:13 PM ChrystianKacki *@*.*> wrote: I've updated the comment about the two new errors <#1531 (comment) <#1531 (comment)>> with source code to replace. — Reply to this email directly, view it on GitHub <#1531 (comment) <#1531 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42G5RZG42AT6A47WWLTYXBDW5AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGI4DQMBXGI <https://github.com/notifications/unsubscribe-auth/AOON42G5RZG42AT6A47WWLTYXBDW5AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGI4DQMBXGI> . You are receiving this because you were mentioned.Message ID: @.*> — Reply to this email directly, view it on GitHub <#1531 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42C3YGU4JAMBKVQ7HKDYXBH47AVCNFSM6AAAAABEI3WP5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGM2DINJWGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

* init commit * fix for issue #1531 * minor fixes

ChrystianKacki changed the title ~~Error while preparing Common Voice dataset~~ Errors while preparing Common Voice dataset Mar 6, 2024

JinZr mentioned this issue Mar 6, 2024

Support pl for the commonvoice recipe #1532

Merged

JinZr added a commit to JinZr/icefall that referenced this issue Mar 7, 2024

fix for issue k2-fsa#1531

4e90233

JinZr mentioned this issue Mar 7, 2024

Minor fixes for the commonvoice recipe #1534

Merged

JinZr added a commit that referenced this issue Mar 8, 2024

Minor fixes for the commonvoice recipe (#1534)

ae61bd4

* init commit * fix for issue #1531 * minor fixes

JinZr closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors while preparing Common Voice dataset #1531

Errors while preparing Common Voice dataset #1531

ChrystianKacki commented Mar 6, 2024 •

edited

Loading

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024 •

edited

Loading

JinZr commented Mar 6, 2024

ChrystianKacki commented Mar 7, 2024

JinZr commented Mar 7, 2024 via email

ChrystianKacki commented Mar 7, 2024 •

edited

Loading

ChrystianKacki commented Mar 7, 2024

JinZr commented Mar 7, 2024 via email

ChrystianKacki commented Mar 7, 2024 •

edited

Loading

JinZr commented Mar 7, 2024 via email

Errors while preparing Common Voice dataset #1531

Errors while preparing Common Voice dataset #1531

Comments

ChrystianKacki commented Mar 6, 2024 • edited Loading

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024

JinZr commented Mar 6, 2024 via email

ChrystianKacki commented Mar 6, 2024 • edited Loading

JinZr commented Mar 6, 2024

ChrystianKacki commented Mar 7, 2024

JinZr commented Mar 7, 2024 via email

ChrystianKacki commented Mar 7, 2024 • edited Loading

ChrystianKacki commented Mar 7, 2024

JinZr commented Mar 7, 2024 via email

ChrystianKacki commented Mar 7, 2024 • edited Loading

JinZr commented Mar 7, 2024 via email

ChrystianKacki commented Mar 6, 2024 •

edited

Loading

ChrystianKacki commented Mar 6, 2024 •

edited

Loading

ChrystianKacki commented Mar 7, 2024 •

edited

Loading

ChrystianKacki commented Mar 7, 2024 •

edited

Loading