-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An overview of the current recommended k2 setup #1517
Comments
Hi Omar, I hope the following answers can resolve your questions.
We recommend using Zipformer (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer), which is our latest audio encoder. It achieves state-of-the-art on a lot of ASR benchmarks, while being surprisingly efficient. Please refer to our paper: https://arxiv.org/abs/2310.11230
BPE is now the most common tokenization scheme.
For transducer, we support three decoding methods: greedy_search, modified_beam_search, fast_beam_search. You can generate N-best hypothesis with
LODR yields the best performance in icefall in terms of language model integration. You may have a look at this page: https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/index.html to know more details about different LM integration methods in icefall.
Yes, 1700 hrs general domain + 100h domain-specific seems enough to build a competitive system. You can train a good LM if you have a lot of text data. You may first train the model on the 1700 hrs general domain data, and then fine-tune the model with the 100hrs domain-specific data (see our recent PR: #1484 ).
We didn't test it thoroughly, but it should be possible. We trained on Libriheavy (an English dataset with punctuations, https://arxiv.org/abs/2309.08105) and the model produces punctuations normally. But we didn't evaluate the quality of the punctuations very carefully. If you have further questions, feel free to ask here. |
On the punctuation issue, I'm strongly in favor of training on punctuated text. I've never seen any problems, it seems to handle it fine; and it makes more sense IMO than trying to add the punctuations later on. |
Thank you for the detailed answers @marcoyang1998 @danpovey! This is super helpful and I think a good reference for others as well. One doubt I still have is which recipe this relates to (considering the LibriSpeech recipes). I think the relevant ones are
but I'm not completely sure what the difference between the From your answer I gathered that the transducer model is inherently suitable for streaming, which is why I'm also a bit confused about the difference between Thanks again! EDIT: |
Hi Omar, We have two versions of Zipformer, and they all exist under LibriSpeech. Please find below the description of the aforementioned recipes (you may refer to this README as well):
As said, zipformer is the latest Zipformer and will be actively maintained, so we strongly recommend you start with it. |
Thanks Marco, that helps a lot! I will start with the I have just a couple questions about the deployment side (maybe it's a better issue for the
Thanks again 🙏 |
We support RNNLM rescoring in sherpa-onnx (see k2-fsa/sherpa-onnx#125). Fangjun @csukuangfj is more familiar with Sherpa-related questions, could you have a look? |
For the two-pass example, it consists of two models:
Usually, the first-pass model is very small and runs very fast. There are two goals of the first pass model:
As soon as an endpoint is detected by the first pass model, the segment between two endpoints is sent to the To give you a concrete example, consider the whisper model, which does not support streaming speech recognition. |
Thanks guys, that's really helpful. I'll let you know if anything else comes up :) |
Hi, thanks for the great work on this library! We're currently using Kaldi in production and are looking to migrate to k2 / icefall. There are quite a lot of different models available already (great!) but this also makes it a bit difficult to understand what is the current recommended approach and best practices to implement k2 for training and deployment. Choices such as the encoder architecture, CTC vs transducer, decoding strategy, how well does punctuation work if adding it while training, etc. I know this is heavily usecase-dependent so I'll try to sketch ours.
For context, our usecase:
It would be immensely appreciated if someone could give some recommendations and point in the right direction, considering the usecase. Especially concerning the following choices:
Thank you very much in advance! I'd also be happy to create a PR to augment the README or docs of this repo with any provided information if this is deemed useful for others
The text was updated successfully, but these errors were encountered: