Need ability to send multiple files in one go #915

kalradivyanshu · 2024-07-19T00:06:55Z

Since #856 got merged, I was wondering if we can have sending multiple files in one go into faster-whisper, something like:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
results = batched_model.transcribe(["audio0.mp3", "audio1.mp3", "audio2.mp3", "audio3.mp3"])

for result in results:
      for segment, info in result:
	    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

This would help usecases where you have a lot of small files. I have a use case where I want to transcript multiple files of upto 30s audio (they will never be more than 30s), so I was wondering if I could stitch them together and pass them in as 1 file into BatchedInferencePipeline? In my limited tests this seems to work, will the segment always be of exact 30s, basically can I be guaranteed if I pad my audio to be exact 30s, that each segment will be for each audio and no segment will contain any transcription from 2 different audios?

Thank you for all your work!

@Jiltseb

The text was updated successfully, but these errors were encountered:

Jiltseb · 2024-07-19T08:16:45Z

Yes, I agree that the ability to send multiple files at once will be awesome and it's in the TODO list. Basically, we need some bookkeeping in addition.
In your use case if the audios are always < 30sec, then you can zero pad them to 30 sec and stitch them together. The VAD should be able to cut them at voiced positions. However, if the sum length of two consecutive audios, audio1+audio2 < 30sec, they will be merged together and can result in a single transcription for 2 different audios. You can avoid this by providing the vad_segments parameter manually.

For example:
vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 13.5),(13.7, 27.5)]},{'start': 31.5, end: 44.5, 'segments': [(31.5, 44.5)]}, {'start': 44.5, end: 60.5, 'segments': [(44.5, 49.5),(49.7, 60.5)]},...]

In the above example, the second and thrid entries together is less than 30 sec, but are split across two dictionaries, making sure each are processed separately in parallel.

kalradivyanshu · 2024-07-19T09:25:48Z

Oh great! Thankyou for your reply!

kalradivyanshu · 2024-07-19T10:47:12Z

@Jiltseb

However, if the sum length of two consecutive audios, audio1+audio2 < 30sec, they will be merged together and can result in a single transcription for 2 different audios. You can avoid this by providing the vad_segments parameter manually.

For example:
vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 13.5),(13.7, 27.5)]},{'start': 31.5, end: 44.5, 'segments': [(31.5, 44.5)]}, {'start': 44.5, end: 60.5, 'segments': [(44.5, 49.5),(49.7, 60.5)]},...]
In the above example, the second and third entries together is less than 30 sec, but are split across two dictionaries, making sure each are processed separately in parallel.

In this example the second segment is13s, and third is 16s. So if I provide the vad segments, I am guessing VAD will not run? So I can't just combine the audio chunks, I have to send them through VAD then get the audio segments and then send them in, right?

My point is that if the second segment is 13s but contains 10s of silence in the end, it can cause Whisper to hallucinate, and since I am manually sending VAD segments, VAD will be skipped in faster-whisper?

So my flow should be:

Preprocess all audio chunks by passing them through VAD and removing all silence > 1s
Concatenate all audio chunks into one, and create vad_segments dict to pass to faster_whisper
Fix the segment start and endtimes

Right? Thank you for all your help!

Jiltseb · 2024-07-19T11:27:54Z

If you already provide vad_segments, VAD will not run internally in addition (no need to run anyway), have a look at the code.
If you have big silences inside the individual audio files that can potentially cause hallucinations, that's another thing to look for.
I was simply mentioning just to pad them to 30sec each.
let's say audio lengths are:
audio 1: 27.5 sec length: zero pad to 30 sec
audio 2: 13 sec length: zero pad to 30 sec
audio 3: 16 sec length: zero pad to 30 sec

In this case the vad segments will be:
vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 27.5)]},{'start': 30.0, end: 43, 'segments': [(30.0, 43)]}, {'start': 60, end: 76, 'segments': [(60, 76)]},...]

A bit hacky implementation without utilizing GPUs fully. But once we have multiple files as input, should be easier for you.

kalradivyanshu · 2024-07-20T20:38:00Z

Hey @Jiltseb, Thank you for the detailed reply!
While playing around with batched model, I saw that VAD segments detection seems to be buggy in the new batched pipeline, I opened a new issue: #919.

Also batched_model with batch_size = 1 seems to be a lot more consistent performance than model.transcribe? Why is that? Model.transcribe sometimes spikes to 1s to process 30s on my L40s, while batched_model with batch_size = 1 always takes around 270ms. I am curious, are there other performance improvements in batched_model?

Jiltseb · 2024-07-21T07:25:59Z

It looks like #919 is related to word_timestamps.

There are several reasons for it. batched_model does not have all the settings as in the original one (for example, temperature fallback) skips some checks, and makes each segment independent of the next one in the batch or pipeline. Have a look at the original PR to see additional details on improvement. Batching removes the dependency on a bigger context, sometimes leading to better results if the context is ambiguous output of the previous segment.

kalradivyanshu · 2024-07-27T02:07:28Z

Hey @Jiltseb if I were to try and open a PR to add ability to send multiple files, how would I go about it? Can you give me a rough guide?

Jiltseb · 2024-07-27T10:03:33Z

Have a look at whisper S2T: https://github.com/shashikg/WhisperS2T They provide support for multiple files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need ability to send multiple files in one go #915

Need ability to send multiple files in one go #915

kalradivyanshu commented Jul 19, 2024

Jiltseb commented Jul 19, 2024 •

edited

Loading

kalradivyanshu commented Jul 19, 2024

kalradivyanshu commented Jul 19, 2024

Jiltseb commented Jul 19, 2024

kalradivyanshu commented Jul 20, 2024

Jiltseb commented Jul 21, 2024

kalradivyanshu commented Jul 27, 2024

Jiltseb commented Jul 27, 2024

Need ability to send multiple files in one go #915

Need ability to send multiple files in one go #915

Comments

kalradivyanshu commented Jul 19, 2024

Jiltseb commented Jul 19, 2024 • edited Loading

kalradivyanshu commented Jul 19, 2024

kalradivyanshu commented Jul 19, 2024

Jiltseb commented Jul 19, 2024

kalradivyanshu commented Jul 20, 2024

Jiltseb commented Jul 21, 2024

kalradivyanshu commented Jul 27, 2024

Jiltseb commented Jul 27, 2024

Jiltseb commented Jul 19, 2024 •

edited

Loading