Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streaming] Reproducing librispeech results - RNN-T emformer #383

Open
funboarder13920 opened this issue May 24, 2022 · 8 comments
Open

Comments

@funboarder13920
Copy link

funboarder13920 commented May 24, 2022

Hello,

I tried to reproduce your results on librispeech streaming: #278 (comment)
I am not done with my hyper parameter search but I was not able to get even close to the results reported.

Do you recall the configuration you used for this training ?

Best,

@csukuangfj
Copy link
Collaborator

The model was training using #278
with the following training commands:

./transducer_emformer/train.py \
  --world-size 8 \
  --num-epochs 65 \
  --start-epoch 0 \
  --exp-dir transducer_emformer/exp-full \
  --full-libri 1 \
  --max-duration 200 \
  --prune-range 5 \
  --lr-factor 5 \
  --lm-scale 0.25 \
  --master-port 12358 \
  --num-encoder-layers 18 \
  --left-context-length 128 \
  --segment-length 8 \
  --right-context-length 4

I am using --epoch 52 --avg 12 for decoding and testing.

@funboarder13920
Copy link
Author

funboarder13920 commented May 24, 2022

Thank you.
Did you disable time warp in specaug ?

@csukuangfj
Copy link
Collaborator

Thank you. Did you disable time warp in specaug ?

No. time warp is used, like other recipes in icefall. You can see that asr_datamodule.py is a symlink.

@csukuangfj
Copy link
Collaborator

I am not done with my hyper parameter search but I was not able to get even close to the results reported.

Please use the changes from that PR directly, not the latest master, and not the latest streaming branch.

I find that #358 makes the WER slightly worse.

@danpovey
Copy link
Collaborator

Back when @glynpu was doing streaming stuff based on WeNet ideas, he found that it was necessary to append some silence to force out the final symbols. That is probably why the padding fix is hurting.

@glynpu
Copy link
Collaborator

glynpu commented May 24, 2022

it was necessary to append some silence to force out the final symbols

yes, dummy extra tailing silence indeed helps, at least for that model. #242

@csukuangfj
Copy link
Collaborator

csukuangfj commented May 24, 2022

@danpovey @glynpu

Thanks!

I looked at the decoding results. At the end of sentences, some tokens at the end of a word is missing.

After doing tail padding (with length equal to left_context_lengh), it mitigates the problem.

To be concrete, the WER for --epoch 26 --avg 6 is decreased from 5.12 to 4.66 after using tail padding.

Attached are the decoding results before and after tailing padding.

after-tail-padding-errs-test-clean-greedy_search-epoch-26-avg-6-context-2-max-sym-per-frame-1.txt
before-tail-padding-errs-test-clean-greedy_search-epoch-26-avg-6-context-2-max-sym-per-frame-1.txt


See #384

@danpovey
Copy link
Collaborator

cool!
BTW, i suspect the reason why the padding frames were degrading WER, was because we apply the padding mask in test time. I suspect that without the padding mask, it would be more robust to trailing silence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants