Transcribing long audio files using Zipformer #923

maltium · 2023-02-22T22:15:57Z

If I have long audio files that potentially I want to transcribe using a Zipformer, short of using a VAD to chop the audio file into smaller 30sec pieces and transcribing them individually, I understand that the only option is to use the streaming version of the model. Is this correct?

Unfortunately the streaming Zipformer takes a significant hit in performance even with a 640ms chunk. Are there things I can do to boost the performance close to the non-streaming version, even if at the cost of higher latency since I don't care about that?

csukuangfj · 2023-02-23T00:44:07Z

Have you tried the non-streaming zipformer?

csukuangfj · 2023-02-23T02:45:24Z

You can use https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
to transcribe your long audio files and see if it works.

maltium · 2023-02-23T10:14:06Z

@csukuangfj As you probably know, the memory requirements to transcribe long audio files using the offline recognizer are enormous - for example a 14 min audio file requires more than 100GB of RAM. The Huggingface interface fails with an error if you attempt a longer audio file.

This is why I'm exploring how the streaming Zipformer can be modified for better performance at the cost of latency (with the added bonus of in-built endpointing).

AdolfVonKleist · 2023-02-23T15:39:40Z

@maltium I'm curious how significant the performance hit is? What are the characteristics of the training corpus in terms of hours, acoustic conditions, etc. and do they match your test data?

danpovey · 2023-02-23T15:44:25Z

It might not really be necessary to use VAD, you could just chop the data up evenly with a small overlap, like 30 sec chunks with 1 sec overlap, and then splice the output together using the times associated with the tokens to split in the middle of the overlapped region. I am asking the guys whether we already have scripts that are public for (i) getting the times of symbols in RNN-T decoding, and/or (ii) the whole pipeline of decoding in chunks and then splicing the transcript back togehter.

yaozengwei · 2023-02-23T15:52:09Z

For RNN-T models, we have supported getting timestamps for each words in some recipes:

maltium · 2023-02-24T19:35:39Z

@danpovey you suggestion works quite well with a larger overlap - with just a 1sec overlap often times I do get some garbage though.

Regarding timings, I'm getting them through sherpa but I notice the timestamp of the first token in the clip is always 0.0, even if it's actually uttered seconds into the audio clip. Additionally, in Kaldi you could get the start and end timestamp for each token which doesn't seem to be the case here.

danpovey · 2023-02-25T16:00:27Z

Thanks!
The 1st-timestamp problem, we are trying to find a way to get rid of that; it seems to be not a bug but something the model decides to do for some reason. We can't have begin & end timestamps because of how RNN-T works.

AdolfVonKleist · 2023-03-20T07:59:13Z

Has anyone tried significantly increasing the --decode-chunk-len for this purpose? We've observed some pretty significant continued gains for the pruned_transducer_stateless7_streaming when applying it to longer audio in an offline context.

In our case, this was using a roughly 3.5K hr dataset with challenging acoustic conditions. Training was run with the same default values used in existing examples for pruned_transducer_stateless7_streaming, but during decoding (greedy_search) we looked at the impact of increasing --decode-chunk-len up to 1024. Beyond 1024 the %WER started to decline again:

`--decode-chunk-len`	%WER	ERR / TOT	INS	DEL	SUB
128	12.95	45786 / 353488	8990	14977	21819
256	9.70	34272 / 353488	7052	10469	16751
512	7.40	26168 / 353488	5058	7460	13650
1024	6.77	23931 / 353488	4508	6649	12774

Here, at 1024 the %WER is close to our best offline result, but we can decode 2+ hour recordings no problem and the session uses < 5GB RAM, with a steady RTF in the 0.06xRT range. Obviously this is not particularly useful for any sort of live-streaming, but for long audio or conversation processing it seems to strike a nice balance, and obviates the need for any additional post- or pre-processing.

It would be interesting to see if this is dataset dependent for others, but we have since confirmed it in two other languages with different sized data sets.

marcoyang1998 mentioned this issue Mar 10, 2023

Fix padding_idx #942

Merged

joazoa mentioned this issue Oct 27, 2023

offline zipformer2, timestamp for first token is always 0 #1347

Closed

guoyifan97 mentioned this issue Jun 24, 2024

Non-streaming Conformer model with pruned_rnnt_loss always emits the first non-blank characters on the very first frames. #1666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcribing long audio files using Zipformer #923

Transcribing long audio files using Zipformer #923

maltium commented Feb 22, 2023

csukuangfj commented Feb 23, 2023

csukuangfj commented Feb 23, 2023

maltium commented Feb 23, 2023

AdolfVonKleist commented Feb 23, 2023

danpovey commented Feb 23, 2023

yaozengwei commented Feb 23, 2023

maltium commented Feb 24, 2023

danpovey commented Feb 25, 2023

AdolfVonKleist commented Mar 20, 2023 •

edited

Loading

Transcribing long audio files using Zipformer #923

Transcribing long audio files using Zipformer #923

Comments

maltium commented Feb 22, 2023

csukuangfj commented Feb 23, 2023

csukuangfj commented Feb 23, 2023

maltium commented Feb 23, 2023

AdolfVonKleist commented Feb 23, 2023

danpovey commented Feb 23, 2023

yaozengwei commented Feb 23, 2023

maltium commented Feb 24, 2023

danpovey commented Feb 25, 2023

AdolfVonKleist commented Mar 20, 2023 • edited Loading

AdolfVonKleist commented Mar 20, 2023 •

edited

Loading