Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About using random combiner to train a narrower and deeper comformer #431

Open
yaozengwei opened this issue Jun 17, 2022 · 5 comments
Open

Comments

@yaozengwei
Copy link
Collaborator

@csukuangfj Following @danpovey's advice, I did some experiments on pruned_transducer_stateless5, with the Medium model as in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#medium.

Here are some results about modifications of RandomCombine class (final_weight=0.5, pure_prob=0.333).

  • train on full librispeech, decode with epoch-32-avg-10, use averaged model
  1. random-combine from layer-0, 2.97 & 7.3
  2. random-combine from layer-0, no linear layers in RandomCombine class, 3.02 & 7.29
  3. random-combine from layer-4, 2.98 & 7.23
  4. random-combine from layer-4, no linear layers in RandomCombine class, 2.88, 6.89
@csukuangfj
Copy link
Collaborator

I think you can try a larger model with the 4th setting.

@yaozengwei
Copy link
Collaborator Author

@danpovey @csukuangfj
For the Baseline model as in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#baseline-2, train on full librispeech, use averaged model

  • decode with epoch-30-avg-10
  1. random-combine from layer-0, 2.49 & 5.75
  2. random-combine from layer-4, no linear layers in RandomCombine class, 2.54, 5.72
  • decode with epoch-30-avg-17
  1. random-combine from layer-0, 2.51 & 5.73
  2. random-combine from layer-4, no linear layers in RandomCombine class, 2.52, 5.7

@danpovey
Copy link
Collaborator

OK, well even though it's not better, it should at least be a little faster. It is also more unlikely to cause problems for fixed-point operation: the linear layer can lead to large activations, potentiallyl.

@yaozengwei
Copy link
Collaborator Author

OK, well even though it's not better, it should at least be a little faster. It is also more unlikely to cause problems for fixed-point operation: the linear layer can lead to large activations, potentiallyl.

I see. Do I need to create a PR for above modification?

@danpovey
Copy link
Collaborator

Yes, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants