Stream audio chunk by chunk to Whisper #261

mat-hek · 2023-10-09T09:58:07Z

Hey, it's already possible to make the serving provide data to the model in chunks, however, it seems that the whole audio still has to be available at once, which is impossible for live streaming. Would it be possible to support streaming the audio to the serving chunk by chunk?

josevalim · 2023-10-09T10:27:03Z

We do support streaming in latest Bumblebee but we only support streaming files. We would need to improve the API so you can pass your own stream to Whisper and then we transform it. :) So it is not possible yet but we may be 90% there.

mat-hek · 2023-10-09T11:45:47Z

Hmm, correct me if I'm wrong, but it seems you load the entire file to the memory upfront here :D

josevalim · 2023-10-09T11:48:24Z

We do, but that could be worked out by doing multiple ffmpeg calls. My point is that the complexity now is in the stream composition/processing, not in Nx/Axon/etc. And the former is much easier!

mat-hek · 2023-10-09T12:12:33Z

So it's just about rewriting client_preprocessing here from Enum to Stream? How would batching be handled then?

josevalim · 2023-10-09T12:45:57Z

The serving does the batching, although ideally you want to chunk the stream to match the server batch size too.

jonatanklosko · 2023-10-09T15:49:10Z

Yeah, reading chunks separately from disk is on my list, but it's just an optimisation so we released without it.

As for accepting a stream as serving input, it's a bit different, but definitely doable. Note that for large audio when we split into multiple chunks, we make the chunks overlap. So for a 50s audio we would transcribe like 0-30 and 20-50, this way each transcription has some context and we merge the overlaps accordingly. So if we are given a stream, we need to accumulate until the right size and emit overlapping chunks.

mat-hek · 2023-10-09T15:57:42Z

this way each transcription has some context and we merge the overlaps accordingly

Yeah, that's actually the reason I'd like to stream to the serving instead of running it for each chunk (as I do now). To make it 'live', I'd need to have at most a few seconds long chunks, ~~but from the docs I see that the default is 5 seconds ;)~~ hmm, I don't know where I found these 5 seconds, it seems it's just what I set 🤔

jonatanklosko · 2023-10-09T19:52:01Z

hmm, I don't know where I found these 5 seconds, it seems it's just what I set

The default context is 1/6 of the chunk length, for whisper the chunk is 30s, so the context is 5s (both sides, so it's a 10s overlap).

I'm not sure if we can reasonably handle arbitrarily small chunks (especially as we do context, because then the context would be very small). So I would imagine we accumulate first 30s, then next 20s, next 20s.

mat-hek · 2023-10-10T08:41:08Z

Small chunks still work pretty well IMO, check Lars's talk where he has a live transcription on slides. From my experience, the accuracy drops for sentences longer than a chunk length, so I guess that context could help here. We can actually provide a lot of 'previous' / 'left side' context without sacrificing latency. The other side context would impact latency, but maybe even 1 or 2 seconds could help, as we wouldn't break words apart.

jonatanklosko · 2023-10-10T16:30:56Z

I would imagine we accumulate first 30s, then next 20s, next 20s.

Ah, we should accumulate whatever is the chunk_length, so yeah it could as well be smaller.

josevalim · 2023-10-13T10:50:24Z

Yeah, we can probably transform the stream to either split or accumulate batch size. We can also just do nothing and tell the user that whatever audio size they pass, it will be sent as is, so the buffering is on their own. The latter is the most flexible and likely the simpler too.

jonastemplestein · 2023-10-16T08:20:52Z

Amazing! I have a little toy project that could really use this (literally, a toy for my daughter that she can speak to).

For my use-case, it is important to minimise the latency after somebody is finished speaking.

Once I detect silence on my end, I'd like to say to bumblebee to "force a chunk", even if it's only been a short time since the last chunk was transcribed.

It would also be really useful to send not just transcribed words to the caller, but also whether or not those words have been "confirmed" by later context (or perhaps the "confidence" in the transcribed words). Whisper is quite good at creating a best guess transcription from a short chunk and often that is good enough to speculatively use. For example, in the context of my voice agent, I might detect silence, force whisper to do what I assume to be a final chunk and send the resulting preliminary transcription onwards to an LLM. But it may turn out the speaker was just making a short pause and resumes speaking. I'd then keep transcribing and if that further context changes the words that I already sent to the LLM, I'll abort the LLM call (provided it's not been read to the user, yet) and re-do it with the new, more correct transcription.

For this to work well, it's valuable to think about how transcripts from overlapping chunks are merged (and how the chunk boundaries are chosen). A good example in the python ecosystem is here: https://github.com/ufal/whisper_streaming

Lots of companies are trying to build low latency voice agents at the moment and I think Elixir would be a great choice for building them, if it had a great realtime transcription implementation. Ideally this would eventually include word-level timestamps and multi-speaker diarization. @jonatanklosko do you know of any efforts in the Elixir community to do this?

BTW regarding the multiple ffmpeg calls per chunk, I think you can probably have a single ffmpeg process that you stream in and out of using stdin and stdout. That would also slightly reduce the latency cost of "booting" an ffmpeg process for each chunk.

josevalim · 2023-10-16T08:35:55Z

If we stream, we will likely expect pcm chunks, so the ffmpeg conversion would be up to you (which you can do with a live process or even a NIF). @mat-hek and the membrane folks will likely have better ideas here.

mat-hek · 2023-10-20T16:24:11Z

If we stream, we will likely expect pcm chunks

Seems very reasonable

conversion would be up to you

You can use Membrane for that too 😄 here's a PR with a Livebook example: membraneframework/membrane_demo#249

lawik · 2023-10-24T18:58:37Z

Is there a difference between it accepting a real stream and repeatedly calling it with the chunk size you want processed?

I guess the current functionality for improving the edges of chunks with overlap and so on suffer when I just send it exactly sized chunks?

@mat-hek as you would know it is not particularly hard to get an appropriate slice of PCM to send it out of Membrane :D.

jonatanklosko · 2023-10-24T19:05:21Z

@lawik the idea is that we get a stream of continuous chunks, but we would still do overlapping as part of preprocessing and then merging in postprocessing to improve the output.

lawik · 2023-10-24T20:06:45Z

Awesome!

linusdm · 2023-10-25T17:37:52Z

Is this discussion targetted at enabling Whisper specifically? Or will these improvements also allow other more general audio processing models (e.g. audio classification models) to benefit from this streaming solution?

jonatanklosko · 2023-10-25T17:46:59Z

@linusdm Whisper is currently the only audio model we support. I'm not sure how relevant input streaming is for classification models, since they predict a single label rather than streaming transcription.

jonatanklosko · 2024-03-11T15:41:56Z

#361 enables input streaming.

Thinking more about this, I'm not entirely sure if the context overlapping algorithm is very going to be effective with small chunks (as needed for live transcription). The way the algorithm works is that we transcribe two subsequent overlapping chunks of audio, and they should result in two sentences overlapping to some extent at the edges. Then we merge the overlaps to hopefully get the right transcription from the left chunk and from the right chunk. The issue with small chunks is that the sentences are short and there may be very few if any overlapping words. Also note that this means an additional delay, because in order to finish a chunk, we need the transcription from the subsequent overlapping chunk.

So for short chunks it may be better to not use the overlapping chunking and have some other logic, such as splitting input at low amplitude points to avoid cutting mid-word.

These are just high-level thoughts though!

samrat · 2024-03-13T07:00:49Z

Hello,

I'm trying to use this in a Livebook using kino_live_audio: https://gist.github.com/samrat/fc5792bfc870ad887f29d4a944cafd7d . I'm passing a Stream to the serving, but I'm not seeing any output. Could you help me figure out what I'm doing wrong?

jonatanklosko · 2024-03-13T08:40:09Z

@samrat the main issue is that you are doing Enum.map instead of Stream.map, so it starts the stream at that point and blocks further execution :) Here's a more minimised example:

.livemd

<!-- livebook:{"app_settings":{"access_type":"public","output_type":"rich","show_source":true,"slug":"vad"}} -->

# Streaming whisper

```elixir
Mix.install(
  [
    {:kino_live_audio, "~> 0.1"},
    {:nx, "~> 0.7.1"},
    {:bumblebee, github: "elixir-nx/bumblebee"},
    {:exla, ">= 0.0.0"},
    {:kino, github: "livebook-dev/kino", override: true}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)
```

## Section

```elixir
{:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text_whisper(
    model_info,
    featurizer,
    tokenizer,
    generation_config,
    compile: [batch_size: 1],
    chunk_num_seconds: 6,
    context_num_seconds: 2,
    stream: true,
    defn_options: [compiler: EXLA]
  )

Kino.start_child({Nx.Serving, serving: serving, name: WhisperServing})
```

```elixir
liveAudio = KinoLiveAudio.new(chunk_size: 1, unit: :s, sample_rate: featurizer.sampling_rate)
```

```elixir
audio_stream =
  liveAudio
  |> Kino.Control.stream()
  |> Stream.map(fn %{chunk: data} ->
    Nx.tensor(data)
    |> Nx.stack()
    |> Nx.reshape({:auto, 1})
    |> Nx.mean(axes: [1])
  end)

frame = Kino.Frame.new() |> Kino.render()

for chunk <- Nx.Serving.batched_run(WhisperServing, audio_stream) do
  Kino.Frame.append(frame, Kino.Text.new(chunk.text, chunk: true))
end
```

Sidenote: if you look at the console logs and the chunks are not being produced, it may be because the page was denied microphone access.

jonatanklosko added the kind:chore Internal improvements label Feb 21, 2024

jonatanklosko mentioned this issue Mar 11, 2024

Support stream input in Whisper serving and stream ffmpeg chunks #361

Merged

jonatanklosko closed this as completed in #361 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream audio chunk by chunk to Whisper #261

Stream audio chunk by chunk to Whisper #261

mat-hek commented Oct 9, 2023 •

edited

Loading

josevalim commented Oct 9, 2023

mat-hek commented Oct 9, 2023

josevalim commented Oct 9, 2023

mat-hek commented Oct 9, 2023

josevalim commented Oct 9, 2023

jonatanklosko commented Oct 9, 2023 •

edited

Loading

mat-hek commented Oct 9, 2023 •

edited

Loading

jonatanklosko commented Oct 9, 2023

mat-hek commented Oct 10, 2023

jonatanklosko commented Oct 10, 2023

josevalim commented Oct 13, 2023

jonastemplestein commented Oct 16, 2023

josevalim commented Oct 16, 2023

mat-hek commented Oct 20, 2023

lawik commented Oct 24, 2023

jonatanklosko commented Oct 24, 2023

lawik commented Oct 24, 2023

linusdm commented Oct 25, 2023

jonatanklosko commented Oct 25, 2023

jonatanklosko commented Mar 11, 2024 •

edited

Loading

samrat commented Mar 13, 2024

jonatanklosko commented Mar 13, 2024 •

edited

Loading

Stream audio chunk by chunk to Whisper #261

Stream audio chunk by chunk to Whisper #261

Comments

mat-hek commented Oct 9, 2023 • edited Loading

josevalim commented Oct 9, 2023

mat-hek commented Oct 9, 2023

josevalim commented Oct 9, 2023

mat-hek commented Oct 9, 2023

josevalim commented Oct 9, 2023

jonatanklosko commented Oct 9, 2023 • edited Loading

mat-hek commented Oct 9, 2023 • edited Loading

jonatanklosko commented Oct 9, 2023

mat-hek commented Oct 10, 2023

jonatanklosko commented Oct 10, 2023

josevalim commented Oct 13, 2023

jonastemplestein commented Oct 16, 2023

josevalim commented Oct 16, 2023

mat-hek commented Oct 20, 2023

lawik commented Oct 24, 2023

jonatanklosko commented Oct 24, 2023

lawik commented Oct 24, 2023

linusdm commented Oct 25, 2023

jonatanklosko commented Oct 25, 2023

jonatanklosko commented Mar 11, 2024 • edited Loading

samrat commented Mar 13, 2024

jonatanklosko commented Mar 13, 2024 • edited Loading

mat-hek commented Oct 9, 2023 •

edited

Loading

jonatanklosko commented Oct 9, 2023 •

edited

Loading

mat-hek commented Oct 9, 2023 •

edited

Loading

jonatanklosko commented Mar 11, 2024 •

edited

Loading

jonatanklosko commented Mar 13, 2024 •

edited

Loading