[Streaming] Reproducing librispeech results - RNN-T emformer #383

funboarder13920 · 2022-05-24T08:56:08Z

Hello,

I tried to reproduce your results on librispeech streaming: #278 (comment)
I am not done with my hyper parameter search but I was not able to get even close to the results reported.

Do you recall the configuration you used for this training ?

Best,

csukuangfj · 2022-05-24T09:08:09Z

The model was training using #278
with the following training commands:

./transducer_emformer/train.py \
  --world-size 8 \
  --num-epochs 65 \
  --start-epoch 0 \
  --exp-dir transducer_emformer/exp-full \
  --full-libri 1 \
  --max-duration 200 \
  --prune-range 5 \
  --lr-factor 5 \
  --lm-scale 0.25 \
  --master-port 12358 \
  --num-encoder-layers 18 \
  --left-context-length 128 \
  --segment-length 8 \
  --right-context-length 4

I am using --epoch 52 --avg 12 for decoding and testing.

funboarder13920 · 2022-05-24T09:18:27Z

Thank you.
Did you disable time warp in specaug ?

csukuangfj · 2022-05-24T09:22:24Z

Thank you. Did you disable time warp in specaug ?

No. time warp is used, like other recipes in icefall. You can see that asr_datamodule.py is a symlink.

csukuangfj · 2022-05-24T11:19:04Z

I am not done with my hyper parameter search but I was not able to get even close to the results reported.

Please use the changes from that PR directly, not the latest master, and not the latest streaming branch.

I find that #358 makes the WER slightly worse.

danpovey · 2022-05-24T13:33:11Z

Back when @glynpu was doing streaming stuff based on WeNet ideas, he found that it was necessary to append some silence to force out the final symbols. That is probably why the padding fix is hurting.

glynpu · 2022-05-24T13:47:39Z

it was necessary to append some silence to force out the final symbols

yes, dummy extra tailing silence indeed helps, at least for that model. #242

csukuangfj · 2022-05-24T15:26:05Z

@danpovey @glynpu

Thanks!

I looked at the decoding results. At the end of sentences, some tokens at the end of a word is missing.

After doing tail padding (with length equal to left_context_lengh), it mitigates the problem.

To be concrete, the WER for --epoch 26 --avg 6 is decreased from 5.12 to 4.66 after using tail padding.

Attached are the decoding results before and after tailing padding.

after-tail-padding-errs-test-clean-greedy_search-epoch-26-avg-6-context-2-max-sym-per-frame-1.txt
before-tail-padding-errs-test-clean-greedy_search-epoch-26-avg-6-context-2-max-sym-per-frame-1.txt

See #384

danpovey · 2022-05-24T15:30:24Z

cool!
BTW, i suspect the reason why the padding frames were degrading WER, was because we apply the padding mask in test time. I suspect that without the padding mask, it would be more robust to trailing silence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Streaming] Reproducing librispeech results - RNN-T emformer #383

[Streaming] Reproducing librispeech results - RNN-T emformer #383

funboarder13920 commented May 24, 2022 •

edited

Loading

csukuangfj commented May 24, 2022

funboarder13920 commented May 24, 2022 •

edited

Loading

csukuangfj commented May 24, 2022

csukuangfj commented May 24, 2022

danpovey commented May 24, 2022

glynpu commented May 24, 2022

csukuangfj commented May 24, 2022 •

edited

Loading

danpovey commented May 24, 2022

[Streaming] Reproducing librispeech results - RNN-T emformer #383

[Streaming] Reproducing librispeech results - RNN-T emformer #383

Comments

funboarder13920 commented May 24, 2022 • edited Loading

csukuangfj commented May 24, 2022

funboarder13920 commented May 24, 2022 • edited Loading

csukuangfj commented May 24, 2022

csukuangfj commented May 24, 2022

danpovey commented May 24, 2022

glynpu commented May 24, 2022

csukuangfj commented May 24, 2022 • edited Loading

danpovey commented May 24, 2022

funboarder13920 commented May 24, 2022 •

edited

Loading

funboarder13920 commented May 24, 2022 •

edited

Loading

csukuangfj commented May 24, 2022 •

edited

Loading