Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An overview of the current recommended k2 setup #1517

Closed
donn-omar opened this issue Feb 23, 2024 · 8 comments
Closed

An overview of the current recommended k2 setup #1517

donn-omar opened this issue Feb 23, 2024 · 8 comments

Comments

@donn-omar
Copy link

donn-omar commented Feb 23, 2024

Hi, thanks for the great work on this library! We're currently using Kaldi in production and are looking to migrate to k2 / icefall. There are quite a lot of different models available already (great!) but this also makes it a bit difficult to understand what is the current recommended approach and best practices to implement k2 for training and deployment. Choices such as the encoder architecture, CTC vs transducer, decoding strategy, how well does punctuation work if adding it while training, etc. I know this is heavily usecase-dependent so I'll try to sketch ours.

For context, our usecase:

  • We are transcribing Dutch, mostly focussed on the healthcare domain.
  • We need to support streaming ASR (and might need async as well). Low latency is important.
  • We have ~1700 hours of general Dutch spoken data (mostly not punctuated; some punctuated), ~100 hours in-domain annotated (punctuated), ~1000+ hours in-domain unannotated.
  • We have a lot of in-domain text data. Using this for Kaldi's language model (ngram) has greatly increased its performance. If possible, we want to use a language model for rescoring the output. The hybrid model works great as we have a lot of very specific words, e.g. medicine names, that can be boosted through textual data.

It would be immensely appreciated if someone could give some recommendations and point in the right direction, considering the usecase. Especially concerning the following choices:

  • CTC vs Transducer architecture and why?
  • What encoder to use? (I suspect Zipformer is the best / newest?)
  • Tokenization scheme (bpe vs word vs phoneme?)
  • What decoding scheme currently works best (and combining it with LM rescoring)? One scheme I've seen that seems to fit well is using CTC and ngram model to generate list of top hypotheses, then rescoring the complete hypothesis with an attention model / LM.
  • What is the current recommended approach to incorporate a separate LM model (maybe a RNN or BERT model) into the ASR model's predictions?
  • Does the amount of training data we have seem sufficient to train a model that would be competitive with our Kaldi setup?
  • We want to show punctuated text to users. Currently we use a punctuation restoration model that works solely on text. Is icefall able to handle training directly on punctuated text? How well does this work in practice?

Thank you very much in advance! I'd also be happy to create a PR to augment the README or docs of this repo with any provided information if this is deemed useful for others

@marcoyang1998
Copy link
Collaborator

Hi Omar,

I hope the following answers can resolve your questions.

CTC vs Transducer architecture and why?
We recommend Transducer because it has shown better performance on a wide range of benchmarks. Both models support streaming and transducer addresses the conditional independence assumption of CTC, making it a better ASR architecture.

What encoder to use?

We recommend using Zipformer (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer), which is our latest audio encoder. It achieves state-of-the-art on a lot of ASR benchmarks, while being surprisingly efficient. Please refer to our paper: https://arxiv.org/abs/2310.11230

Tokenization scheme (bpe vs word vs phoneme?)

BPE is now the most common tokenization scheme.

What decoding scheme currently works best.

For transducer, we support three decoding methods: greedy_search, modified_beam_search, fast_beam_search. You can generate N-best hypothesis with xx_beam_search and rescore them using either a Ngram language model or an RNN LM.

What is the current recommended approach to incorporate a separate LM model (maybe a RNN or BERT model) into the ASR model's predictions?

LODR yields the best performance in icefall in terms of language model integration. You may have a look at this page: https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/index.html to know more details about different LM integration methods in icefall.

Does the amount of training data we have seem sufficient to train a model that would be competitive with our Kaldi setup?

Yes, 1700 hrs general domain + 100h domain-specific seems enough to build a competitive system. You can train a good LM if you have a lot of text data. You may first train the model on the 1700 hrs general domain data, and then fine-tune the model with the 100hrs domain-specific data (see our recent PR: #1484 ).

Is icefall able to handle training directly on punctuated text? How well does this work in practice?

We didn't test it thoroughly, but it should be possible. We trained on Libriheavy (an English dataset with punctuations, https://arxiv.org/abs/2309.08105) and the model produces punctuations normally. But we didn't evaluate the quality of the punctuations very carefully.

If you have further questions, feel free to ask here.

@danpovey
Copy link
Collaborator

On the punctuation issue, I'm strongly in favor of training on punctuated text. I've never seen any problems, it seems to handle it fine; and it makes more sense IMO than trying to add the punctuations later on.
The only problem in theory is since we use "stateless" transducers it does not see the distant history, so the model is at a disadvantage in terms of matching quotation-mark styles. But hopefully it will just pick the most common one. If the
decoding is done with a neural LM using some kind of shallow fusion or LODR, this shouldn't be a problem though.

@donn-omar
Copy link
Author

donn-omar commented Feb 26, 2024

Thank you for the detailed answers @marcoyang1998 @danpovey! This is super helpful and I think a good reference for others as well.

One doubt I still have is which recipe this relates to (considering the LibriSpeech recipes). I think the relevant ones are

but I'm not completely sure what the difference between the zipformer and pruned_transducer_statelessX recipes is.

From your answer I gathered that the transducer model is inherently suitable for streaming, which is why I'm also a bit confused about the difference between pruned_transducer_stateless8 and pruned_transducer_stateless7_streaming. Is there a specific recipe you would recommend sticking to to start out with?

Thanks again!

EDIT:
I'm sorry, should've taken a better look at the README of the librispeech recipes, which says that zipformer is indeed the latest recipe (though the difference between zipformer, streaming zipformer and upgraded zipformer is still not completely clear to me).

@marcoyang1998
Copy link
Collaborator

Hi Omar,

We have two versions of Zipformer, and they all exist under LibriSpeech. Please find below the description of the aforementioned recipes (you may refer to this README as well):

  • pruned_transducer_stateless7 This is the first version of Zipformer (Zipformer 1.0), and you can only train a non-streaming Zipformer model using this recipe.
  • pruned_transducer_stateless7_streaming This is the streaming version of Zipformer 1.0, hence the suffix streaming.
  • zipformer This is the latest version of Zipformer, and this recipe supports training both streaming or non-streaming models (controlled by --causal in train.py).

As said, zipformer is the latest Zipformer and will be actively maintained, so we strongly recommend you start with it.

@donn-omar
Copy link
Author

Thanks Marco, that helps a lot! I will start with the Zipformer then.

I have just a couple questions about the deployment side (maybe it's a better issue for the sherpa repo, but I think it fits well here):

  • In the recipes in the icefall repo, there are options for decoding using external language models, e.g. modified_beam_search_LODR, modified_beam_search_lm_rescore, modified_beam_search_lm_rescore_LODR. However, in the sherpa/sherpa-onnx docs I don't see these being mentioned, just modified_beam_search. What is the difference between the implementation in sherpa and the implementation in icefall? Looking through the code, I see that an lm can be supplied to OnlineRecognizer.from_transducer. What kind of LM should this be and how should it be obtained?
  • I see an interesting example in the sherpa examples regarding two-pass transcription in two-pass-speech-recognition-from-microphone.py. Can you say anything about how well this works and what improvement you typically see when doing the second pass compared to just the streaming model? In this case you'll need both a streaming and a non-streaming model. Is there a way in which you can be smart about training and share some parameters or do you just have to do 2 training runs?

Thanks again 🙏

@marcoyang1998
Copy link
Collaborator

We support RNNLM rescoring in sherpa-onnx (see k2-fsa/sherpa-onnx#125).

Fangjun @csukuangfj is more familiar with Sherpa-related questions, could you have a look?

@csukuangfj
Copy link
Collaborator

For the two-pass example, it consists of two models:

  • a streaming model, used in the first pass
  • a non-streaming model, used in the second pass

Usually, the first-pass model is very small and runs very fast. There are two goals of the first pass model:

  • give recognition results to the user as fast as possible, though the results may not be that accurate
  • function as an endpointer

As soon as an endpoint is detected by the first pass model, the segment between two endpoints is sent to the
second pass model for recognition. The final result is determined by the second pass model, which is more accurate.


To give you a concrete example, consider the whisper model, which does not support streaming speech recognition.
You can use the whisper model as the second pass model and use a very small streaming model as the first pass model.
In this way, you can use the whisper model in a streaming fashion.

@donn-omar
Copy link
Author

Thanks guys, that's really helpful. I'll let you know if anything else comes up :)

@JinZr JinZr closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants