Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream audio chunk by chunk to Whisper #261

Closed
mat-hek opened this issue Oct 9, 2023 · 22 comments · Fixed by #361
Closed

Stream audio chunk by chunk to Whisper #261

mat-hek opened this issue Oct 9, 2023 · 22 comments · Fixed by #361
Labels
kind:chore Internal improvements

Comments

@mat-hek
Copy link

mat-hek commented Oct 9, 2023

Hey, it's already possible to make the serving provide data to the model in chunks, however, it seems that the whole audio still has to be available at once, which is impossible for live streaming. Would it be possible to support streaming the audio to the serving chunk by chunk?

@josevalim
Copy link
Contributor

We do support streaming in latest Bumblebee but we only support streaming files. We would need to improve the API so you can pass your own stream to Whisper and then we transform it. :) So it is not possible yet but we may be 90% there.

@mat-hek
Copy link
Author

mat-hek commented Oct 9, 2023

Hmm, correct me if I'm wrong, but it seems you load the entire file to the memory upfront here :D

@josevalim
Copy link
Contributor

We do, but that could be worked out by doing multiple ffmpeg calls. My point is that the complexity now is in the stream composition/processing, not in Nx/Axon/etc. And the former is much easier!

@mat-hek
Copy link
Author

mat-hek commented Oct 9, 2023

So it's just about rewriting client_preprocessing here from Enum to Stream? How would batching be handled then?

@josevalim
Copy link
Contributor

The serving does the batching, although ideally you want to chunk the stream to match the server batch size too.

@jonatanklosko
Copy link
Member

jonatanklosko commented Oct 9, 2023

Yeah, reading chunks separately from disk is on my list, but it's just an optimisation so we released without it.

As for accepting a stream as serving input, it's a bit different, but definitely doable. Note that for large audio when we split into multiple chunks, we make the chunks overlap. So for a 50s audio we would transcribe like 0-30 and 20-50, this way each transcription has some context and we merge the overlaps accordingly. So if we are given a stream, we need to accumulate until the right size and emit overlapping chunks.

@mat-hek
Copy link
Author

mat-hek commented Oct 9, 2023

this way each transcription has some context and we merge the overlaps accordingly

Yeah, that's actually the reason I'd like to stream to the serving instead of running it for each chunk (as I do now). To make it 'live', I'd need to have at most a few seconds long chunks, but from the docs I see that the default is 5 seconds ;) hmm, I don't know where I found these 5 seconds, it seems it's just what I set 🤔

@jonatanklosko
Copy link
Member

hmm, I don't know where I found these 5 seconds, it seems it's just what I set

The default context is 1/6 of the chunk length, for whisper the chunk is 30s, so the context is 5s (both sides, so it's a 10s overlap).

I'm not sure if we can reasonably handle arbitrarily small chunks (especially as we do context, because then the context would be very small). So I would imagine we accumulate first 30s, then next 20s, next 20s.

@mat-hek
Copy link
Author

mat-hek commented Oct 10, 2023

Small chunks still work pretty well IMO, check Lars's talk where he has a live transcription on slides. From my experience, the accuracy drops for sentences longer than a chunk length, so I guess that context could help here. We can actually provide a lot of 'previous' / 'left side' context without sacrificing latency. The other side context would impact latency, but maybe even 1 or 2 seconds could help, as we wouldn't break words apart.

@jonatanklosko
Copy link
Member

I would imagine we accumulate first 30s, then next 20s, next 20s.

Ah, we should accumulate whatever is the chunk_length, so yeah it could as well be smaller.

@josevalim
Copy link
Contributor

Yeah, we can probably transform the stream to either split or accumulate batch size. We can also just do nothing and tell the user that whatever audio size they pass, it will be sent as is, so the buffering is on their own. The latter is the most flexible and likely the simpler too.

@jonastemplestein
Copy link

Amazing! I have a little toy project that could really use this (literally, a toy for my daughter that she can speak to).

For my use-case, it is important to minimise the latency after somebody is finished speaking.

Once I detect silence on my end, I'd like to say to bumblebee to "force a chunk", even if it's only been a short time since the last chunk was transcribed.

It would also be really useful to send not just transcribed words to the caller, but also whether or not those words have been "confirmed" by later context (or perhaps the "confidence" in the transcribed words). Whisper is quite good at creating a best guess transcription from a short chunk and often that is good enough to speculatively use. For example, in the context of my voice agent, I might detect silence, force whisper to do what I assume to be a final chunk and send the resulting preliminary transcription onwards to an LLM. But it may turn out the speaker was just making a short pause and resumes speaking. I'd then keep transcribing and if that further context changes the words that I already sent to the LLM, I'll abort the LLM call (provided it's not been read to the user, yet) and re-do it with the new, more correct transcription.

For this to work well, it's valuable to think about how transcripts from overlapping chunks are merged (and how the chunk boundaries are chosen). A good example in the python ecosystem is here: https://github.com/ufal/whisper_streaming

Lots of companies are trying to build low latency voice agents at the moment and I think Elixir would be a great choice for building them, if it had a great realtime transcription implementation. Ideally this would eventually include word-level timestamps and multi-speaker diarization. @jonatanklosko do you know of any efforts in the Elixir community to do this?

BTW regarding the multiple ffmpeg calls per chunk, I think you can probably have a single ffmpeg process that you stream in and out of using stdin and stdout. That would also slightly reduce the latency cost of "booting" an ffmpeg process for each chunk.

@josevalim
Copy link
Contributor

If we stream, we will likely expect pcm chunks, so the ffmpeg conversion would be up to you (which you can do with a live process or even a NIF). @mat-hek and the membrane folks will likely have better ideas here.

@mat-hek
Copy link
Author

mat-hek commented Oct 20, 2023

If we stream, we will likely expect pcm chunks

Seems very reasonable

conversion would be up to you

You can use Membrane for that too 😄 here's a PR with a Livebook example: membraneframework/membrane_demo#249

@lawik
Copy link

lawik commented Oct 24, 2023

Is there a difference between it accepting a real stream and repeatedly calling it with the chunk size you want processed?

I guess the current functionality for improving the edges of chunks with overlap and so on suffer when I just send it exactly sized chunks?

@mat-hek as you would know it is not particularly hard to get an appropriate slice of PCM to send it out of Membrane :D.

@jonatanklosko
Copy link
Member

@lawik the idea is that we get a stream of continuous chunks, but we would still do overlapping as part of preprocessing and then merging in postprocessing to improve the output.

@lawik
Copy link

lawik commented Oct 24, 2023

Awesome!

@linusdm
Copy link
Contributor

linusdm commented Oct 25, 2023

Is this discussion targetted at enabling Whisper specifically? Or will these improvements also allow other more general audio processing models (e.g. audio classification models) to benefit from this streaming solution?

@jonatanklosko
Copy link
Member

@linusdm Whisper is currently the only audio model we support. I'm not sure how relevant input streaming is for classification models, since they predict a single label rather than streaming transcription.

@jonatanklosko
Copy link
Member

jonatanklosko commented Mar 11, 2024

#361 enables input streaming.

Thinking more about this, I'm not entirely sure if the context overlapping algorithm is very going to be effective with small chunks (as needed for live transcription). The way the algorithm works is that we transcribe two subsequent overlapping chunks of audio, and they should result in two sentences overlapping to some extent at the edges. Then we merge the overlaps to hopefully get the right transcription from the left chunk and from the right chunk. The issue with small chunks is that the sentences are short and there may be very few if any overlapping words. Also note that this means an additional delay, because in order to finish a chunk, we need the transcription from the subsequent overlapping chunk.

So for short chunks it may be better to not use the overlapping chunking and have some other logic, such as splitting input at low amplitude points to avoid cutting mid-word.

These are just high-level thoughts though!

@samrat
Copy link

samrat commented Mar 13, 2024

Hello,

I'm trying to use this in a Livebook using kino_live_audio: https://gist.github.com/samrat/fc5792bfc870ad887f29d4a944cafd7d . I'm passing a Stream to the serving, but I'm not seeing any output. Could you help me figure out what I'm doing wrong?

@jonatanklosko
Copy link
Member

jonatanklosko commented Mar 13, 2024

@samrat the main issue is that you are doing Enum.map instead of Stream.map, so it starts the stream at that point and blocks further execution :) Here's a more minimised example:

.livemd
<!-- livebook:{"app_settings":{"access_type":"public","output_type":"rich","show_source":true,"slug":"vad"}} -->

# Streaming whisper

```elixir
Mix.install(
  [
    {:kino_live_audio, "~> 0.1"},
    {:nx, "~> 0.7.1"},
    {:bumblebee, github: "elixir-nx/bumblebee"},
    {:exla, ">= 0.0.0"},
    {:kino, github: "livebook-dev/kino", override: true}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)
```

## Section

```elixir
{:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text_whisper(
    model_info,
    featurizer,
    tokenizer,
    generation_config,
    compile: [batch_size: 1],
    chunk_num_seconds: 6,
    context_num_seconds: 2,
    stream: true,
    defn_options: [compiler: EXLA]
  )

Kino.start_child({Nx.Serving, serving: serving, name: WhisperServing})
```

```elixir
liveAudio = KinoLiveAudio.new(chunk_size: 1, unit: :s, sample_rate: featurizer.sampling_rate)
```

```elixir
audio_stream =
  liveAudio
  |> Kino.Control.stream()
  |> Stream.map(fn %{chunk: data} ->
    Nx.tensor(data)
    |> Nx.stack()
    |> Nx.reshape({:auto, 1})
    |> Nx.mean(axes: [1])
  end)

frame = Kino.Frame.new() |> Kino.render()

for chunk <- Nx.Serving.batched_run(WhisperServing, audio_stream) do
  Kino.Frame.append(frame, Kino.Text.new(chunk.text, chunk: true))
end
```

Sidenote: if you look at the console logs and the chunks are not being produced, it may be because the page was denied microphone access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:chore Internal improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants