-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transcribing long audio files using Zipformer #923
Comments
Have you tried the non-streaming zipformer? |
You can use https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition |
@csukuangfj As you probably know, the memory requirements to transcribe long audio files using the offline recognizer are enormous - for example a 14 min audio file requires more than 100GB of RAM. The Huggingface interface fails with an error if you attempt a longer audio file. This is why I'm exploring how the streaming Zipformer can be modified for better performance at the cost of latency (with the added bonus of in-built endpointing). |
@maltium I'm curious how significant the performance hit is? What are the characteristics of the training corpus in terms of hours, acoustic conditions, etc. and do they match your test data? |
It might not really be necessary to use VAD, you could just chop the data up evenly with a small overlap, like 30 sec chunks with 1 sec overlap, and then splice the output together using the times associated with the tokens to split in the middle of the overlapped region. I am asking the guys whether we already have scripts that are public for (i) getting the times of symbols in RNN-T decoding, and/or (ii) the whole pipeline of decoding in chunks and then splicing the transcript back togehter. |
For RNN-T models, we have supported getting timestamps for each words in some recipes: |
@danpovey you suggestion works quite well with a larger overlap - with just a 1sec overlap often times I do get some garbage though. Regarding timings, I'm getting them through sherpa but I notice the timestamp of the first token in the clip is always 0.0, even if it's actually uttered seconds into the audio clip. Additionally, in Kaldi you could get the start and end timestamp for each token which doesn't seem to be the case here. |
Thanks! |
Has anyone tried significantly increasing the In our case, this was using a roughly 3.5K hr dataset with challenging acoustic conditions. Training was run with the same default values used in existing examples for
Here, at It would be interesting to see if this is dataset dependent for others, but we have since confirmed it in two other languages with different sized data sets. |
If I have long audio files that potentially I want to transcribe using a Zipformer, short of using a VAD to chop the audio file into smaller 30sec pieces and transcribing them individually, I understand that the only option is to use the streaming version of the model. Is this correct?
Unfortunately the streaming Zipformer takes a significant hit in performance even with a 640ms chunk. Are there things I can do to boost the performance close to the non-streaming version, even if at the cost of higher latency since I don't care about that?
The text was updated successfully, but these errors were encountered: