Really Real Time Speech To Text #608

davabase · 2022-11-29T06:59:43Z

davabase
Nov 29, 2022

I've seen some of the examples that do real time transcription and they're great but they all record short snippets of audio and then transcribe them one after the other. This has two problems:

The transcription is cut off since the audio is abruptly stopped.
Some of the speech will be lost since Whisper is blocking while transcribing.

I tackled these problems by always recording audio in a thread so there are no gaps and by concatenating the previous audio data with the latest recording. This allows you to rerun transcriptions on previously incomplete audio snippets. The result is that the model can correct issues from when it transcribed a recording that was cutoff.

Here's a demo of me reading The Last Question by Isaac Asimov. I really like how you can see the progress of the transcription quality, it first transcribes small recording to give that real time feel and then the transcription gets better when more audio data is added.

If you had a UI that showed transcribed text, you could update the text in real time as it is corrected with new audio data.

You can check out the demo here: https://github.com/davabase/whisper_real_time
The demo has features to detect when speech stops and start a new audio buffer, in theory you could just string together an endless audio buffer and keep feeding it to the model, though this would make it take longer to transcribe each time.

Root-FTW · 2022-11-29T18:43:44Z

Root-FTW
Nov 29, 2022

I've seen some of the examples that do real time transcription and they're great but they all record short snippets of audio and then transcribe them one after the other. This has two problems:

The transcription is cut off since the audio is abruptly stopped.

Some of the speech will be lost since Whisper is blocking while transcribing.

I tackled these problems by always recording audio in a thread so there are no gaps and by concatenating the previous audio data with the latest recording. This allows you to rerun transcriptions on previously incomplete audio snippets. The result is that the model can correct issues from when it transcribed a recording that was cutoff.

Here's a demo of me reading The Last Question by Isaac Asimov. I really like how you can see the progress of the transcription quality, it first transcribes small recording to give that real time feel and then the transcription gets better when more audio data is added.

If you had a UI that showed transcribed text, you could update the text in real time as it is corrected with new audio data.

You can check out the demo here: https://github.com/davabase/whisper_real_time The demo has features to detect when speech stops and start a new audio buffer, in theory you could just string together an endless audio buffer and keep feeding it to the model, though this would make it take longer to transcribe each time.

OMG, could this be used for my request from a few days ago?
#572

only that instead of showing in text what I say in my microphone, it would be great if the audio from my PC was passed in text (what my teammates say)

10 replies

UmangRajpara13 Aug 28, 2023

something similar was how Ables' Voice Interface worked initially, where every question asked to the agent becomes a Google search. however, as the chat agent gains personification in the future, some of the questions will be required to be dispatched to the agent directly. (example: Do I have any appointments today?, Do I have 'some_app' installed on my computer?, etc.) and not a Google search or OpenAI GPT response (as you asked).

however, as of today, you can use TalkGPT to ask/say anything with voice to make a GPT-3.5 chat completion (note: speech-to-text functionality for TalkGPT is provided by Able Voice Interface).

shcallaway Sep 8, 2023

@davabase If I understand correctly, your project works like this:

Append new audio chunk to existing file
Transcribe file
Repeat

Does this mean the file gets longer and longer over time? Have you noticed any performance issues for longer audio streams?

A few solutions I can think of:

Transcribing a sliding window of audio chunks, as opposed to the entire file
Transcribing larger chunks one by one, using Voice Activity Detection (example) to determine chunk start/end

davabase Sep 12, 2023
Author

Yes, you are correct. When the program re-transcribes a file that has new audio context it retroactively replaces the old transcription with the new updated one. That way any text that may have been cut off or misheard is properly fixed when more audio is given.

This does mean that the file gets longer over time and that there will be a longer transcription type for these larger files.

I intentionally did this to avoid chunking. Splitting the audio into chunks causes splits in the middle of words which leads to incorrect transcriptions, the chunking has to be smarter. In my real time demo this is accomplished by detecting pauses in speech that are at least phrase_time seconds in length, which seems to work well enough in most cases.

If someone is talking nonstop without pausing for long enough then the file will get big. Only transcribing the end of the file or using a window like you suggested would help with this.

Kishlay-notabot Aug 1, 2024

@davabase so if each time a small bit is added to the existing audio file, does whisper transcribe it from 0:00 to [x:xx + recent chunk y:yy]?
Isn't it technically transcribing a file totally from 0 each time a chunk is added? How is it realtime then :( Im curious

Kishlay-notabot Aug 15, 2024

@shcallaway
https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream
The sliding window has been implemented above, just letting you know. I've been studying whisper recently so yeah I'm pretty late but here it goes...

wwbnjsace · 2022-12-02T10:40:00Z

wwbnjsace
Dec 2, 2022

when i test a 5s audio ,the whisper will coust 10s to get the asr result ,so is it very slow ?

2 replies

davabase Dec 3, 2022
Author

How real-time the program is ultimately depends on the performance of your computer. If it can't transcribe faster than real-time, then I'm afraid you're out of luck.

wwbnjsace Dec 5, 2022

the whisper infer the audio by 30s ,so a 5s audio may be cost 30s audio's time ,so this may be slow for 5s audio?the whisper infer the audio by 30s ,so a 5s audio may be cost 30s audio's time ,so this may be slow for 5s audio?

davabase · 2022-12-03T08:44:37Z

davabase
Dec 3, 2022
Author

I decided to take this idea a little further and made a GUI app that various settings and features. You can check it out here: https://github.com/davabase/transcriber_app/

7 replies

Root-FTW Dec 3, 2022

The major difference between the py version and the exe version is that the exe version can only run on your CPU. If you want to utilize your GPU you'll have to run it from source with the CUDA version of PyTorch. If it's fast enough on the CPU then it shouldn't make a difference.

Unfortunately Whisper only supports translating other languages to English, there is no way to translate to Spanish.

I can't find how to make it work using my GPU, download the version Transcriber 1.0.0 Source code
(zip) , I installed the requerimets.txt and ran the script, it works perfect, but I keep seeing that it uses 70% of my CPU the same as the .exe version

davabase Dec 3, 2022
Author

Hmm. In your python environment can you try running:

python
>>> torch.cuda.is_available()
>>> torch.cuda.get_device_name(torch.cuda.current_device())

and paste the results here.

It's possible that you have a CPU only version of torch installed, which would be weird since the requirements.txt explicitly calls for the CUDA version.

You can also try to force CUDA in the code by changing line 123 in transcriber.py from:

audio_model = whisper.load_model(model)

to

audio_model = whisper.load_model(model, 'cuda')

davabase Dec 3, 2022
Author

I searched around to see if I could find a good offline text translation model for python and I came across this page: https://skeptric.com/python-offline-translation/

It's a little bit of work but I got this working in my Transcriber app.

In your python environment:

pip install sentencepiece
pip install huggingface

Then after all the imports on line 35 of transcriber.py enter:

from transformers import MarianMTModel, MarianTokenizer
class Translator:
    def __init__(self, source_lang: str, dest_lang: str) -> None:
        self.model_name = f'Helsinki-NLP/opus-mt-{source_lang}-{dest_lang}'
        self.model = MarianMTModel.from_pretrained(self.model_name)
        self.tokenizer = MarianTokenizer.from_pretrained(self.model_name)

    def translate(self, texts):
        tokens = self.tokenizer(list(texts), return_tensors="pt", padding=True)
        translate_tokens = self.model.generate(**tokens)
        return [self.tokenizer.decode(t, skip_special_tokens=True) for t in translate_tokens]

marian_en_es = Translator('en', 'es')

Then around line 530 after this line:

text = result['text'].strip()

translate the text with this:

# Whisper will put a line of just dots if there is no speech. This doesn't get translated well so we don't try.
if text.replace('.', ''):
    text = marian_en_es.translate([text])[0]

Now text will be Spanish instead of English. The translations seem ok to me.

I don't think I'll add this to this official code, because I'm running out of space for UI controls, but if you hack it like this it will work for you.

Root-FTW Dec 3, 2022

It worked perfectly changing line 123 as you indicated, using the tiny mu model GPU uses 30% of its capacity and the subtitles are generated in less than 1 second

THANKS!!!

JonathanFly Dec 9, 2022

I was looking into this because I'd like dual subtitles sometimes, a transcription in the original language and translation. Flet can render transcriber_app as a web page, which in theory means you could clone the text in another textarea and let Google Chrome auto translate it continuously. Chrome does this with live text chat just fine. But I tried it and the web page is using canvas so the content of the transcription isn't visible in the HTML until you click it with the mouse, I think? Even when everything is selectable. So Chrome can't see it. There might be an easy fix to make it work but I didn't see anything obvious. I guess you could constantly trigger some in page javascript.

oliverrenner · 2022-12-05T09:02:27Z

oliverrenner
Dec 5, 2022

what hardware setup would you suggest to run this with the large model as fast as reasonably possible?
I wanna use this for a voice-assistant application, so quick turnaround is key.

2 replies

Root-FTW Dec 6, 2022

what hardware setup would you suggest to run this with the large model as fast as reasonably possible? I wanna use this for a voice-assistant application, so quick turnaround is key.

My questions here:

I tried using the medium model and it's not even close to real time, it's very slow

My Specs:

i9-9900k
64 GB RAM
GUP: GTX 1080 TI

davabase Dec 8, 2022
Author

My computer is brand new with an RTX 3090 TI in it. I made this pretty quickly and didn't really try to optimize for smaller machines, I think it could be done though, especially with the C++ implementation of Whisper. I might try it if I find the time.

oliverrenner · 2022-12-06T05:52:46Z

oliverrenner
Dec 6, 2022

whisper not sure if whisper can be "real-time" ... but it can be fast!
I'm currently using a NLU provider which uses the large model under the hood and it takes around 2-3 seconds to transcribe a 4-6 second clip .. that's ok, its just on the upper limits and I'd like it faster.

2 replies

NathanJReidy Jan 17, 2023

@oliverrenner which NLU provider are you using?

oliverrenner Jan 17, 2023

Hi Nathan,

I'm using http://nlpcloud.com

wwbnjsace · 2022-12-06T06:37:31Z

wwbnjsace
Dec 6, 2022

i run the stream ,but there is no output!
i run the stream in a server ,and i plug a wired headset into the server then speaking.

whisper.cpp-master$ ./stream -m models/ggml-large.bin -t 8 --step 1000 --length 5000 -kc -ac 512
audio_sdl_init: found 2 capture devices:
audio_sdl_init: - Capture device #0: 'Built-in Audio Analog Stereo'
audio_sdl_init: - Capture device #1: 'Razer Seiren Mini Analog Mono'
audio_sdl_init: attempt to open default capture device ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init: - sample rate: 16000
audio_sdl_init: - format: 33056 (required: 33056)
audio_sdl_init: - channels: 1 (required: 1)
audio_sdl_init: - samples per frame: 1024

main: processing 16000 samples (step = 1.0 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ...
main: n_new_line = 4

[Buzzing]

main: WARNING: cannot process audio fast enough, dropping audio ...

[ Silence ]

main: WARNING: cannot process audio fast enough, dropping audio ...

[ Silence ] [ Buzzing ]

main: WARNING: cannot process audio fast enough, dropping audio ..

0 replies

nyadla-sys · 2022-12-06T21:54:38Z

nyadla-sys
Dec 6, 2022

Please see my project below, which uses the Whisper Tiny Tflite Model to implement audio streaming..
Streaming using TFLite model

1 reply

milsun Dec 10, 2022

What is the latency of tflite, compared to cpp version?

FR33TR1ST · 2022-12-11T06:39:23Z

FR33TR1ST
Dec 11, 2022

https://github.com/FR33TR1ST/whisper_realtime/blob/5e046b16a9ae32ba6e8aa5d595cffb9cbf221a6d/Voice_Asistant.py
i made an voice assistant, with real time Whisper

0 replies

leonkosak · 2023-01-02T20:38:57Z

leonkosak
Jan 2, 2023

@davabase is it possible that web application (from browser) is streaming audio to whisper as yours (let's say that we have Docker environment) and that string output is therefore inside web application?

2 replies

wwbnjsace Jan 3, 2023

@davabase is it possible that web application (from browser) is streaming audio to whisper as yours (let's say that we have Docker environment) and that string output is therefore inside web application?

the same as i

davabase Jan 3, 2023
Author

Sure, I don't see why not. I am not a web dev expert, but you could record short clips of audio in the browser and then send them to a python server via WebSockets or some other mechanism, and then concatenate the audio clips and transcribe them in the same way this program does.

dustinjoe · 2023-01-09T19:44:35Z

dustinjoe
Jan 9, 2023

I actually did a simple time measurement test. It was run on a V100 GPU. One confusing thing I encountered is when I disabled FP16 and ran it at FP32 it was running faster. I shared my test code here:
#817
Results:

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP16 Half-precision Model Loading
Loading whisper model:
Model size: tiny.en
It took 12.0662 second(s) to complete.
Audio processing speed: 52.01609076009553 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP32 Full-precision Model Loading
Loading whisper model:
Model size: tiny.en
It took 9.5657 second(s) to complete.
Audio processing speed: 65.61362694848098 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP16 Half-precision Model Loading
Loading whisper model:
Model size: base.en
It took 16.0241 second(s) to complete.
Audio processing speed: 39.16842970839244 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP32 Full-precision Model Loading
Loading whisper model:
Model size: base.en
It took 13.3854 second(s) to complete.
Audio processing speed: 46.88966390254995 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP16 Half-precision Model Loading
Loading whisper model:
Model size: small.en
It took 32.0151 second(s) to complete.
Audio processing speed: 19.60443526646702 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP32 Full-precision Model Loading
Loading whisper model:
Model size: small.en
It took 24.0949 second(s) to complete.
Audio processing speed: 26.048534142959927 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP16 Half-precision Model Loading
Loading whisper model:
Model size: medium.en
It took 62.8192 second(s) to complete.
Audio processing speed: 9.991172488709445 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP32 Full-precision Model Loading
Loading whisper model:
Model size: medium.en
It took 47.3050 second(s) to complete.
Audio processing speed: 13.267869072033172 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP16 Half-precision Model Loading
Loading whisper model:
Model size: large
It took 86.4479 second(s) to complete.
Audio processing speed: 7.2602909671227245 seconds of input audio per second

===========================================

Audio Length(sec): 627.637
Audio sampling rate(hz): 22050
FP32 Full-precision Model Loading
Loading whisper model:
Model size: large
It took 67.0031 second(s) to complete.
Audio processing speed: 9.367282263133722 seconds of input audio per second

===========================================

Not sure it would be useful to someone and any suggestions on this? Thank you all!

0 replies

NathanJReidy · 2023-01-17T07:35:57Z

NathanJReidy
Jan 17, 2023

Thanks!

…

On Tue, 17 Jan 2023, 6:00 pm Oliver Renner, ***@***.***> wrote: Hi Nathan, I'm using http://nlpcloud.com — Reply to this email directly, view it on GitHub <#608 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOKQGXIQYP72PS6DWPJJ6IDWSY7QVANCNFSM6AAAAAASOEEIQ4> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

ghost · 2023-01-21T07:09:43Z

ghost
Jan 21, 2023

Has this yet been implemented into the master repo? Would be excited to try this

0 replies

zanjabil2502 · 2023-01-23T14:45:06Z

zanjabil2502
Jan 23, 2023

You can send chunk of speech in streaming using REDIS to handle queue in whisper engine, before transcribing there are VAD from silero for detect probabilities of chunk, if probability more hinger than thershold, the chunk will buffer, and the buffer will pass in VAD again for detect probability of one segment audio from the chunk, if the probability of the one segment more higher than threshold, one segment will transcribing.

I use this method, and the result show fast process using large model, and reduce random frase cause backgroumd noise.

2 replies

IlyaNizamov Jan 30, 2024

You can tell us more about your method.

zanjabil2502 Jan 31, 2024

Actually, there is a new flow from me for whisper streaming, but not real streaming. You can use VAD feature from whisper, from their research paper, whisper can be VAD and i using this feature. This feature really important for create streaming flow. I use whisper CTranslate2 and the flow for streaming, i use flow based on faster-whisper. I use eot token and timestamp token as VAD from whisper.

rvizn · 2023-02-22T15:09:28Z

rvizn
Feb 22, 2023

This is great stuff! I was looking into utilizing OpenAI Whisper and using serverless GPU for the computing power. However just running the math, it get's super expensive if you are say transcribing 80 hours of conversations. Most serverless GPUs cost between .003 to .004 per minute which doesn't seem feasible if you are transcribing say 160 minutes of audio per day.

Is there some alternative solutions other than having to use your own hardware to transcribe? Is there other cloud solutions?

1 reply

zanjabil2502 Feb 23, 2023

Good news, you can reduce cost GPU of whisper, you can use whisper c-translate, max GPU when you use large model only 3.5 GB and memory CPU only 10 core and you can limit core until 2 core, and you can multi processing when use c-translate whisper

ce7in · 2023-07-19T18:37:07Z

ce7in
Jul 19, 2023

Thanks.

0 replies

UmangRajpara13 · 2023-07-28T10:53:22Z

UmangRajpara13
Jul 28, 2023

I recently developed a project that is a bit related to contents of this discussion.

project available at : https://github.com/voyagingstar/able

0 replies

AbhinavMir · 2023-10-05T04:17:15Z

AbhinavMir
Oct 5, 2023

Amazing!

0 replies

PPIVS · 2024-01-06T03:55:00Z

PPIVS
Jan 6, 2024

I am also working on whisper ai to real time transcribing when click record button in django.

my blog is:
https://web-spidy.com/

0 replies

Gldkslfmsd · 2024-01-24T13:19:36Z

Gldkslfmsd
Jan 24, 2024

Hi, all. I recommend Whisper-Streaming for really real-time speech-to-text. There's a self-adaptive latency policy -- based on the actual complexity of the source.
#1978

0 replies

ZaneH · 2024-01-26T02:50:13Z

ZaneH
Jan 26, 2024

Thanks for this davabase! I adapted it into a Discord bot that can be voice controlled like you might use an Alexa.

https://github.com/ZaneH/heybilly-prototype

0 replies

orgh0 · 2024-06-03T03:05:59Z

orgh0
Jun 3, 2024

Has anyone worked on deploying whisper for 1000+ concurrent users?

Batching requests efficiently would be the main challenge along with real time infra. Are there any good open source projects for setting up this infra?

0 replies

umarijaz04 · 2024-08-18T10:59:20Z

umarijaz04
Aug 18, 2024

I'm looking to have this implemented as real-time speech in NodeJS. Can you assist me with this? @davabase

0 replies

Really Real Time Speech To Text #608

Replies: 22 comments · 29 replies

davabase Sep 12, 2023 Author

davabase Dec 3, 2022 Author

davabase Dec 3, 2022 Author

davabase Dec 3, 2022 Author

davabase Dec 3, 2022 Author

davabase Dec 8, 2022 Author

davabase Jan 3, 2023 Author

Replies: 22 comments 29 replies

davabase Sep 12, 2023
Author

davabase Dec 3, 2022
Author

davabase
Dec 3, 2022
Author

davabase Dec 3, 2022
Author

davabase Dec 3, 2022
Author

davabase Dec 8, 2022
Author

davabase Jan 3, 2023
Author