Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcribing long audio files using Zipformer #923

Open
maltium opened this issue Feb 22, 2023 · 9 comments
Open

Transcribing long audio files using Zipformer #923

maltium opened this issue Feb 22, 2023 · 9 comments

Comments

@maltium
Copy link

maltium commented Feb 22, 2023

If I have long audio files that potentially I want to transcribe using a Zipformer, short of using a VAD to chop the audio file into smaller 30sec pieces and transcribing them individually, I understand that the only option is to use the streaming version of the model. Is this correct?

Unfortunately the streaming Zipformer takes a significant hit in performance even with a 640ms chunk. Are there things I can do to boost the performance close to the non-streaming version, even if at the cost of higher latency since I don't care about that?

@csukuangfj
Copy link
Collaborator

Have you tried the non-streaming zipformer?

@csukuangfj
Copy link
Collaborator

You can use https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
to transcribe your long audio files and see if it works.

Screenshot 2023-02-23 at 10 44 02

@maltium
Copy link
Author

maltium commented Feb 23, 2023

@csukuangfj As you probably know, the memory requirements to transcribe long audio files using the offline recognizer are enormous - for example a 14 min audio file requires more than 100GB of RAM. The Huggingface interface fails with an error if you attempt a longer audio file.

This is why I'm exploring how the streaming Zipformer can be modified for better performance at the cost of latency (with the added bonus of in-built endpointing).

@AdolfVonKleist
Copy link

@maltium I'm curious how significant the performance hit is? What are the characteristics of the training corpus in terms of hours, acoustic conditions, etc. and do they match your test data?

@danpovey
Copy link
Collaborator

It might not really be necessary to use VAD, you could just chop the data up evenly with a small overlap, like 30 sec chunks with 1 sec overlap, and then splice the output together using the times associated with the tokens to split in the middle of the overlapped region. I am asking the guys whether we already have scripts that are public for (i) getting the times of symbols in RNN-T decoding, and/or (ii) the whole pipeline of decoding in chunks and then splicing the transcript back togehter.

@maltium
Copy link
Author

maltium commented Feb 24, 2023

@danpovey you suggestion works quite well with a larger overlap - with just a 1sec overlap often times I do get some garbage though.

Regarding timings, I'm getting them through sherpa but I notice the timestamp of the first token in the clip is always 0.0, even if it's actually uttered seconds into the audio clip. Additionally, in Kaldi you could get the start and end timestamp for each token which doesn't seem to be the case here.

@danpovey
Copy link
Collaborator

Thanks!
The 1st-timestamp problem, we are trying to find a way to get rid of that; it seems to be not a bug but something the model decides to do for some reason. We can't have begin & end timestamps because of how RNN-T works.

@AdolfVonKleist
Copy link

AdolfVonKleist commented Mar 20, 2023

Has anyone tried significantly increasing the --decode-chunk-len for this purpose? We've observed some pretty significant continued gains for the pruned_transducer_stateless7_streaming when applying it to longer audio in an offline context.

In our case, this was using a roughly 3.5K hr dataset with challenging acoustic conditions. Training was run with the same default values used in existing examples for pruned_transducer_stateless7_streaming, but during decoding (greedy_search) we looked at the impact of increasing --decode-chunk-len up to 1024. Beyond 1024 the %WER started to decline again:

--decode-chunk-len %WER ERR / TOT INS DEL SUB
128 12.95 45786 / 353488 8990 14977 21819
256 9.70 34272 / 353488 7052 10469 16751
512 7.40 26168 / 353488 5058 7460 13650
1024 6.77 23931 / 353488 4508 6649 12774

Here, at 1024 the %WER is close to our best offline result, but we can decode 2+ hour recordings no problem and the session uses < 5GB RAM, with a steady RTF in the 0.06xRT range. Obviously this is not particularly useful for any sort of live-streaming, but for long audio or conversation processing it seems to strike a nice balance, and obviates the need for any additional post- or pre-processing.

It would be interesting to see if this is dataset dependent for others, but we have since confirmed it in two other languages with different sized data sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants