Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipformer wenetspeech #1130

Merged
merged 27 commits into from
Jun 26, 2023
Merged

zipformer wenetspeech #1130

merged 27 commits into from
Jun 26, 2023

Conversation

pkufool
Copy link
Collaborator

@pkufool pkufool commented Jun 15, 2023

This is the wenetspeech recipe on the latest zipformer model (modeling with characters).

Non streaming model

The training command (use the default medium size model)

./zipformer/train.py \
  --world-size 6 \
  --num-epochs 12 \
  --use-fp16 1 \
  --max-duration 450 \
  --training-subset L \
  --lr-epochs 1.5 \
  --context-size 2 \
  --exp-dir zipformer/exp_L_context_2 \
  --causal 0 \
  --num-workers 8

Best results for each epoch

Epoch Greedy search(dev & net & meeting) Modified beam search(dev & net & meeting)  
4 7.83 & 8.86 &13.73 7.75 & 8.81 & 13.67 avg=1;blank-penalty=2
5 7.75 & 8.46 & 13.38 7.68 & 8.41 & 13.27 avg=1;blank-penalty=2
6 7.72 & 8.19 & 13.16 7.62 & 8.14 & 13.06 avg=1;blank-penalty=2
7 7.59 & 8.08 & 12.97 7.53 & 8.01 & 12.87 avg=2;blank-penalty=2
8 7.68 & 7.87 & 12.96 7.61 & 7.81 & 12.88 avg=1;blank-penalty=2
9 7.57 & 7.77 & 12.87 7.5 & 7.71 & 12.77 avg=1;blank-penalty=2
10 7.45 & 7.7 & 12.69 7.39 & 7.63 & 12.59 avg=2;blank-penalty=2
11 7.35 & 7.67 & 12.46 7.31 & 7.63 & 12.43 avg=3;blank-penalty=2
12 7.36 & 7.65 & 12.43 7.32 & 7.61 & 12.35 avg=4;blank-penalty=2

The influence of blank-penalty (greedy search result at epoch 12)

blank-penalty Dev Test-net Test-meeting
0 8.58 7.84 14.64
1 7.82 7.64 13.08
1.5 7.55 7.62 12.63
2 7.36 7.65 12.43
2.5 7.24 7.77 12.37
3 7.24 7.94 12.48

Streaming model

The training command (use the default medium size model)

./zipformer/train.py \
  --world-size 8 \
  --num-epochs 12 \
  --use-fp16 1 \
  --max-duration 450 \
  --training-subset L \
  --lr-epochs 1.5 \
  --context-size 2 \
  --exp-dir zipformer/exp_L_causal_context_2 \
  --causal 1 \
  --num-workers 8

Best results for each epoch (--chunk-size=16; --left-context-frames=128)

Epoch Greedy search(dev & net & meeting) Modified beam search(dev & net & meeting)  
6 9.14 & 10.75 & 18.15 8.79 & 10.54 & 17.64 avg=1;blank-penalty=1.5
7 9.11 & 10.61 & 17.86 8.8 & 10.42 & 17.29 avg=1;blank-penalty=1.5
8 8.89 & 10.32 & 17.44 8.59 & 10.09 & 16.9 avg=1;blank-penalty=1.5
9 8.86 & 10.11 & 17.35 8.55 & 9.87 & 16.76 avg=1;blank-penalty=1.5
10 8.66 & 10.0 & 16.94 8.39 & 9.83 & 16.47 avg=2;blank-penalty=1.5
11 8.58 & 9.92 & 16.67 8.32 & 9.77 & 16.27 avg=3;blank-penalty=1.5
12 8.45 & 9.89 & 16.46 8.21 & 9.77 & 16.07 avg=4;blank-penalty=1.5

The influence of blank-penalty (greedy search result at epoch 12, chunk-size=32,left-context-frames=128)

blank-penalty Dev Test-net Test-meeting
0 9.03 9.22 16.63
1 8.26 9.02 15.48
1.5 8.01 9.05 15.32
2 7.88 9.19 15.39
2.5 7.9 9.44 15.7
3 8.03 9.77 16.3

The decoding result for different latency (greedy search results)

--chunk-size=16; --left-context-frames=64

Epoch Dev Test-net Test-meeting  
6 9.17 10.91 18.78 avg=1;blank-penalty=1.5
7 9.12 10.77 18.48 avg=1;blank-penalty=1.5
8 8.95 10.48 18.12 avg=1;blank-penalty=1.5
9 8.92 10.28 18.02 avg=1;blank-penalty=1.5
10 8.73 10.15 17.58 avg=2;blank-penalty=1.5
11 8.68 10.08 17.37 avg=3;blank-penalty=1.5
12 8.54 10.04 17.16 avg=4;blank-penalty=1.5

--chunk-size=32; --left-context-frames=128

Epoch Dev Test-net Test-meeting  
6 8.7 9.86 16.83 avg=1;blank-penalty=1.5
7 8.71 9.7 16.6 avg=1;blank-penalty=1.5
8 8.52 9.46 16.23 avg=1;blank-penalty=1.5
9 8.46 9.29 16.17 avg=1;blank-penalty=1.5
10 8.25 9.14 15.74 avg=2;blank-penalty=1.5
11 8.15 9.08 15.52 avg=3;blank-penalty=1.5
12 8.01 9.05 15.32 avg=4;blank-penalty=1.5

--chunk-size=64; --left-context-frames=256

Epoch Dev Test-net Test-meeting  
6 8.36 9.18 15.5 avg=1;blank-penalty=1.5
7 8.36 9.05 15.32 avg=1;blank-penalty=1.5
8 8.16 8.85 14.96 avg=1;blank-penalty=1.5
9 8.14 8.64 14.89 avg=1;blank-penalty=1.5
10 7.91 8.54 14.54 avg=2;blank-penalty=1.5
11 7.82 8.49 14.31 avg=3;blank-penalty=1.5
12 7.67 8.47 14.07 avg=4;blank-penalty=1.5

@kobenaxie
Copy link
Contributor

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

@pkufool
Copy link
Collaborator Author

pkufool commented Jun 19, 2023

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

We add blank penalty because we saw a lot of deletion errors in the decoded results, it might relate to the subsampling mechanism in zipformer, @danpovey may have more to say.

@pkufool
Copy link
Collaborator Author

pkufool commented Jun 23, 2023

The best results:

Type Greedy(dev & net & meeting) Beam search(dev & net & meeting)  
Non-streaming 7.36 & 7.65 & 12.43 7.32 & 7.61 & 12.35 --epoch=12
Streaming 8.45 & 9.89 & 16.46 8.21 & 9.77 & 16.07 --epoch=12; --chunk-size=16; --left-context-frames=256
Streaming 8.0 & 9.0 & 15.11 7.84 & 8.94 & 14.92 --epoch=12; --chunk-size=32; --left-context-frames=256

The model (Non-streaming): https://huggingface.co/pkufool/icefall-asr-zipformer-wenetspeech-20230615
The model (Streaming) : https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615

Comparing with other open-sourced results:

Toolkit Dev Test-Net Test-Meeting AIshell
Kaldi 9.07 12.83 24.72 5.41
Espnet 9.70 8.90 15.90 3.90
Wenet 8.88 9.70 15.59 4.61
Next-gen Kaldi 7.32 7.61 12.35 3.7

Comparing with our previous results (pruned_transducer_stateless5):

Model Type Greedy Beam Search  
Reworked Conformer Non-streaming 8.22 & 9.03 & 14.54 8.17 & 9.04 & 14.44 --epoch 4
Zipformer Non-streaming 7.83 & 8.86 &13.73 7.75 & 8.81 & 13.67 --epoch 4
Reworked Conformer Streaming 8.78 & 10.12 & 16.16 8.53 & 9.95 & 15.81 --epoch 7; latency=320ms
Zipformer Streaming 8.35 & 9.59 & 16.26 8.35 & 9.46 & 15.85 --epoch 7; latency=320ms

@csukuangfj
Copy link
Collaborator

Did you use CTC loss during training?

@pkufool
Copy link
Collaborator Author

pkufool commented Jun 23, 2023

Did you use CTC loss during training?

No

@pkufool
Copy link
Collaborator Author

pkufool commented Jun 23, 2023

@csukuangfj I made some changes to export.py and export-onnx.py (accepting tokens.txt rather than bpe.model), so that they can be shared among different reicpes, I think it is better to mantain only one copy of exporting code.

@csukuangfj
Copy link
Collaborator

I think it is better to mantain only one copy of exporting code.

Agreed. You can use symlinks to avoid additional copies.

@pkufool
Copy link
Collaborator Author

pkufool commented Jun 23, 2023

I think it is better to mantain only one copy of exporting code.

Agreed. You can use symlinks to avoid additional copies.

Then, you can have a look at the changes under librispeech/ASR/zipformer, see if it is OK,and do I miss some other code that need to be changed.

@pkufool pkufool added ready and removed ready labels Jun 24, 2023
@danpovey
Copy link
Collaborator

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

We add blank penalty because we saw a lot of deletion errors in the decoded results, it might relate to the subsampling mechanism in zipformer, @danpovey may have more to say.

We don't really know why we had to compensate for deletion errors in this particular setup, because we haven't seen this effect in other zipformer examples or in other types of system on this data. If it recurs we may develop a better theory.

@pkufool pkufool added ready and removed ready labels Jun 25, 2023
@pkufool pkufool added ready and removed ready labels Jun 25, 2023
@pkufool pkufool merged commit 219bba1 into k2-fsa:master Jun 26, 2023

params.vocab_size = sp.get_piece_size()
token_table = k2.SymbolTable.from_file(params.tokens)
params.vocab_size = num_tokens(token_table)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be

    params.vocab_size = num_tokens(token_table) + 1

+1 is missing.

@OswaldoBornemann
Copy link

@pkufool So I noticed that you used 6 GPUs to train the zipformer2, and each epoch costs 22 hours. So what kind of GPU did you use?

@yaozengwei
Copy link
Collaborator

@pkufool So I noticed that you used 6 GPUs to train the zipformer2, and each epoch costs 22 hours. So what kind of GPU did you use?

32G nvidia tesla v100

@OswaldoBornemann
Copy link

I see. Have you noticed that the latest lhoste makes zipformer training much faster than before? Do you have similar experience?

@pkufool
Copy link
Collaborator Author

pkufool commented Dec 25, 2023

I see. Have you noticed that the latest lhoste makes zipformer training much faster than before? Do you have similar experience?

We didn't train it recently, will try it. Thanks!

@xingchensong
Copy link

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

We add blank penalty because we saw a lot of deletion errors in the decoded results, it might relate to the subsampling mechanism in zipformer, @danpovey may have more to say.

Does the deletion error usually occur at the beginning, middle, or end of the decoded result?

@xingchensong
Copy link

In the conformer model, I have similarly encountered an exceptionally high proportion of deletion errors in test_meeting and the majority of these errors consist of omissions of modal particles and redundant characters.

image

image

@xingchensong
Copy link

Hi guys, I add a similar penalty to CTC-based conformer and find that it is really helpful.

I guess that this is caused by the training dataset (wenetspeech), in which we can find many low-quality paired data.

For more infos, plz see wenet-e2e/wenet#2278

@xingchensong
Copy link

cc @pkufool @danpovey

@xingchensong
Copy link

This is a very interesting phenomenon, and I believe it's worth our time to delve deeper into the underlying principles together.

@pkufool
Copy link
Collaborator Author

pkufool commented Jan 5, 2024

@xingchensong FYI, blank-penalty does not help on zipformer large mode (around 148M params). Yes, very interesting phenomenon.

@pkufool
Copy link
Collaborator Author

pkufool commented Jan 5, 2024

@xingchensong What's your model size, we found that for small & medium zipformer we need blank-penanty, but for large model (more powerful ?) we don't need it. Maybe you can try increase the #param to see if it is also true for your model.

@xingchensong
Copy link

@xingchensong What's your model size, we found that for small & medium zipformer we need blank-penanty, but for large model (more powerful ?) we don't need it. Maybe you can try increase the #param to see if it is also true for your model.

116.9M trained under unified streaming&non-streaming mode.

https://e2qq6pi6j9.feishu.cn/docx/EFpod2n30omXITx08OAcMSjlnxd

@Alex2025Job
Copy link

Hi, I used 4 GPU to train the streaming zipformer model (rnnt-loss) using wenetspeech. The parameter setting is the same as the one provided in the huggingface (https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615/tree/main/logs/training), and the training/validation loss seems good, but the testing results(WER) are not so good as the pre-trained model, about 1-4% WER larger then the models provided in the huggingface in each epoch. For example, Epoch 9(1 average, chunk-size=32; left-context-frames=256), but I have got 19.6% WER for MEETING test set.
The only difference I see is that I used 4 GPU but the pre-trained one used 8 GPU, is there any parameter should be manually tunned based on the number of GPU? or is there any other possible reasons?
zipformer_loss

@yaozengwei
Copy link
Collaborator

Hi, I used 4 GPU to train the streaming zipformer model (rnnt-loss) using wenetspeech. The parameter setting is the same as the one provided in the huggingface (https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615/tree/main/logs/training), and the training/validation loss seems good, but the testing results(WER) are not so good as the pre-trained model, about 1-4% WER larger then the models provided in the huggingface in each epoch. For example, Epoch 9(1 average, chunk-size=32; left-context-frames=256), but I have got 19.6% WER for MEETING test set. The only difference I see is that I used 4 GPU but the pre-trained one used 8 GPU, is there any parameter should be manually tunned based on the number of GPU? or is there any other possible reasons? zipformer_loss

What value of max-duration are you using? For the pretrained model, what's the result at epoch 9 with the decoding setup: 1 average, chunk-size=32; left-context-frames=256?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants