Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors while preparing Common Voice dataset #1531

Closed
ChrystianKacki opened this issue Mar 6, 2024 · 16 comments
Closed

Errors while preparing Common Voice dataset #1531

ChrystianKacki opened this issue Mar 6, 2024 · 16 comments

Comments

@ChrystianKacki
Copy link

ChrystianKacki commented Mar 6, 2024

I'm trying to build zipformer model with Common Voice dataset.
I'm using script prepare.sh, located in https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR.
I downloaded the data manually, because pointing URL is now forbidden and requires to submit e-mail address or to be logged in.
When I execute the script, at Stage 9: Prepare BPE based lang get the following error:

RuntimeError: Internal: src/trainer_interface.cc(662) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (500). Please set it to a value <= 6.

Traceback (most recent call last):

  • File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 115, in
    main()
  • File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/train_bpe_model.py", line 93, in main
    spm.SentencePieceTrainer.train(
  • File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1047, in Train
    SentencePieceTrainer._Train(arg=arg, **kwargs)
  • File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 1040, in _Train
    return SentencePieceTrainer._TrainFromMap(new_kwargs)
  • File "/home/chrychoo/.local/lib/python3.10/site-packages/sentencepiece/init.py", line 985, in _TrainFromMap
    return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)

From the traceback, I think this occurs because vocab_sizes variable at the begining of script prepare.sh is set to 500, and then it is passed to train_bpe_model.py script. From here, the code in the method SentencePieceTrainer.train() of sentencepiece library throws this error.
How to deal with this ? Can I simply change the vocab_sizes variable value to 6 in prepare.sh script ?

@JinZr
Copy link
Collaborator

JinZr commented Mar 6, 2024 via email

@ChrystianKacki
Copy link
Author

@JinZr Do you mean the text in file $lang_dir/transcript_words.txt ?

Looks like there’s an issue with the text you use to train the bpe model, could you show us whats the text looked like?

@JinZr
Copy link
Collaborator

JinZr commented Mar 6, 2024 via email

@ChrystianKacki
Copy link
Author

@JinZr Please see the attached file: transcript_words.txt.
Should the text file contain something or can be empty ?
Now it contains only one word "pl".
I have to say, that I've changed the Common Voice dataset in script prepare.sh from French to Polish.
I modified variables at the beginning of the script to:

release=cv-corpus-16.1-2023-12-06
lang=pl

@JinZr
Copy link
Collaborator

JinZr commented Mar 6, 2024 via email

@ChrystianKacki
Copy link
Author

Yes you are right. I've found a solution.

It could be helpful for those who want to adapt the Common Voice model receipt to different languages than English and French.

In prepare.sh at Stage 3: Preprocess CommonVoice manifest there is script local/preprocess_commonvoice.py executed.
In method normalize_text() of this script the following code should be added, according to used language:

if language == "pl":
return re.sub(r"[^a-ząćęłńóśźżA-ZĄĆĘŁŃÓŚŹŻ' ]", "", utt).upper()

@ChrystianKacki ChrystianKacki changed the title Error while preparing Common Voice dataset Errors while preparing Common Voice dataset Mar 6, 2024
@JinZr
Copy link
Collaborator

JinZr commented Mar 6, 2024 via email

@ChrystianKacki
Copy link
Author

ChrystianKacki commented Mar 6, 2024

There are also another 2 errors in stages: Stage 11: Compile HLG and Stage 12: Compile LG:

  1. FileNotFoundError: [Errno 2] No such file or directory: 'data/lm/G_3_gram.fst.txt'
    Traceback (most recent call last):
    File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_hlg.py", line 171, in
    main()
    File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_hlg.py", line 161, in main
    HLG = compile_HLG(lang_dir, args.lm)
    File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_hlg.py", line 85, in compile_HLG
    with open(f"data/lm/{lm}.fst.txt") as f:

  2. FileNotFoundError: [Errno 2] No such file or directory: 'data/lm/G_3_gram.fst.txt
    Traceback (most recent call last):
    File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_lg.py", line 151, in
    main()
    File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_lg.py", line 141, in main
    LG = compile_LG(lang_dir, args.lm)
    File "/mnt/c/Users/chrys/IdeaProjects/sherpa-onnx-model/model-common_voice/./local/compile_lg.py", line 79, in compile_LG
    with open(f"data/lm/{lm}.fst.txt") as f:

To fix these errors you need to modify scripts: local/compile_hlg.py and local/compile_lg.py.
In methods compile_HLG() and compile_LG() you have to replace all path strings beginning with "data/lm/..." to {lang_dir}/lm/....

The code to replace in scripts compile_hlg.py and compile_lg.py is the same. It has to be done in methods compile_HLG() and compile_LG():

if Path(f"{lang_dir}/lm/{lm}.pt").is_file():
    logging.info(f"Loading pre-compiled {lm}")
    d = torch.load(f"{lang_dir}/lm/{lm}.pt")
    G = k2.Fsa.from_dict(d)
else:
    logging.info(f"Loading {lm}.fst.txt")
    with open(f"{lang_dir}/lm/{lm}.fst.txt") as f:
        G = k2.Fsa.from_openfst(f.read(), acceptor=False)
        torch.save(G.as_dict(), f"{lang_dir}/lm/{lm}.pt")

@JinZr
Copy link
Collaborator

JinZr commented Mar 6, 2024

hi, do you mind checking this PR #1532 and see if it fixes the text norm issue?

thanks!

@ChrystianKacki
Copy link
Author

I've seen the commit #1532, and the code from it works OK in my environment.
So now the script prepare.sh goes clearly through stage 9.

There are also another two errors later: in stages 11 and 12 of the script.
Please see the comment above, which describes how to fix them.

hi, do you mind checking this PR #1532 and see if it fixes the text norm issue?

thanks!

@JinZr
Copy link
Collaborator

JinZr commented Mar 7, 2024 via email

@ChrystianKacki
Copy link
Author

ChrystianKacki commented Mar 7, 2024

Thanks! Happy to contribute :-)

Best Regards
Chrystian

@ChrystianKacki
Copy link
Author

I've updated the comment about the two new errors with source code to replace.

JinZr added a commit to JinZr/icefall that referenced this issue Mar 7, 2024
@JinZr
Copy link
Collaborator

JinZr commented Mar 7, 2024 via email

@ChrystianKacki
Copy link
Author

ChrystianKacki commented Mar 7, 2024

I checked both scripts and they work correctly.
So now prepare.sh goes through all stages without any errors.

@JinZr
Copy link
Collaborator

JinZr commented Mar 7, 2024 via email

JinZr added a commit that referenced this issue Mar 8, 2024
* init commit

* fix for issue #1531

* minor fixes
@JinZr JinZr closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants