Really Real Time Speech To Text #608
Replies: 22 comments 29 replies
-
OMG, could this be used for my request from a few days ago? only that instead of showing in text what I say in my microphone, it would be great if the audio from my PC was passed in text (what my teammates say) |
Beta Was this translation helpful? Give feedback.
-
when i test a 5s audio ,the whisper will coust 10s to get the asr result ,so is it very slow ? |
Beta Was this translation helpful? Give feedback.
-
I decided to take this idea a little further and made a GUI app that various settings and features. You can check it out here: https://github.com/davabase/transcriber_app/ |
Beta Was this translation helpful? Give feedback.
-
what hardware setup would you suggest to run this with the large model as fast as reasonably possible? |
Beta Was this translation helpful? Give feedback.
-
whisper not sure if whisper can be "real-time" ... but it can be fast! |
Beta Was this translation helpful? Give feedback.
-
i run the stream ,but there is no output! whisper.cpp-master$ ./stream -m models/ggml-large.bin -t 8 --step 1000 --length 5000 -kc -ac 512 main: processing 16000 samples (step = 1.0 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ... [Buzzing] main: WARNING: cannot process audio fast enough, dropping audio ... [ Silence ] main: WARNING: cannot process audio fast enough, dropping audio ... [ Silence ] [ Buzzing ] main: WARNING: cannot process audio fast enough, dropping audio .. |
Beta Was this translation helpful? Give feedback.
-
Please see my project below, which uses the Whisper Tiny Tflite Model to implement audio streaming.. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/FR33TR1ST/whisper_realtime/blob/5e046b16a9ae32ba6e8aa5d595cffb9cbf221a6d/Voice_Asistant.py |
Beta Was this translation helpful? Give feedback.
-
@davabase is it possible that web application (from browser) is streaming audio to whisper as yours (let's say that we have Docker environment) and that string output is therefore inside web application? |
Beta Was this translation helpful? Give feedback.
-
I actually did a simple time measurement test. It was run on a V100 GPU. One confusing thing I encountered is when I disabled FP16 and ran it at FP32 it was running faster. I shared my test code here: =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Audio Length(sec): 627.637 =========================================== Not sure it would be useful to someone and any suggestions on this? Thank you all! |
Beta Was this translation helpful? Give feedback.
-
Thanks!
…On Tue, 17 Jan 2023, 6:00 pm Oliver Renner, ***@***.***> wrote:
Hi Nathan,
I'm using http://nlpcloud.com
—
Reply to this email directly, view it on GitHub
<#608 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOKQGXIQYP72PS6DWPJJ6IDWSY7QVANCNFSM6AAAAAASOEEIQ4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Has this yet been implemented into the master repo? Would be excited to try this |
Beta Was this translation helpful? Give feedback.
-
You can send chunk of speech in streaming using REDIS to handle queue in whisper engine, before transcribing there are VAD from silero for detect probabilities of chunk, if probability more hinger than thershold, the chunk will buffer, and the buffer will pass in VAD again for detect probability of one segment audio from the chunk, if the probability of the one segment more higher than threshold, one segment will transcribing. I use this method, and the result show fast process using large model, and reduce random frase cause backgroumd noise. |
Beta Was this translation helpful? Give feedback.
-
This is great stuff! I was looking into utilizing OpenAI Whisper and using serverless GPU for the computing power. However just running the math, it get's super expensive if you are say transcribing 80 hours of conversations. Most serverless GPUs cost between .003 to .004 per minute which doesn't seem feasible if you are transcribing say 160 minutes of audio per day. Is there some alternative solutions other than having to use your own hardware to transcribe? Is there other cloud solutions? |
Beta Was this translation helpful? Give feedback.
-
I recently developed a project that is a bit related to contents of this discussion. project available at : https://github.com/voyagingstar/able |
Beta Was this translation helpful? Give feedback.
-
Amazing! |
Beta Was this translation helpful? Give feedback.
-
I am also working on whisper ai to real time transcribing when click record button in django. my blog is: |
Beta Was this translation helpful? Give feedback.
-
Hi, all. I recommend Whisper-Streaming for really real-time speech-to-text. There's a self-adaptive latency policy -- based on the actual complexity of the source. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this davabase! I adapted it into a Discord bot that can be voice controlled like you might use an Alexa. |
Beta Was this translation helpful? Give feedback.
-
Has anyone worked on deploying whisper for 1000+ concurrent users? Batching requests efficiently would be the main challenge along with real time infra. Are there any good open source projects for setting up this infra? |
Beta Was this translation helpful? Give feedback.
-
I'm looking to have this implemented as real-time speech in NodeJS. Can you assist me with this? @davabase |
Beta Was this translation helpful? Give feedback.
-
I've seen some of the examples that do real time transcription and they're great but they all record short snippets of audio and then transcribe them one after the other. This has two problems:
I tackled these problems by always recording audio in a thread so there are no gaps and by concatenating the previous audio data with the latest recording. This allows you to rerun transcriptions on previously incomplete audio snippets. The result is that the model can correct issues from when it transcribed a recording that was cutoff.
Here's a demo of me reading The Last Question by Isaac Asimov. I really like how you can see the progress of the transcription quality, it first transcribes small recording to give that real time feel and then the transcription gets better when more audio data is added.
If you had a UI that showed transcribed text, you could update the text in real time as it is corrected with new audio data.
You can check out the demo here: https://github.com/davabase/whisper_real_time
The demo has features to detect when speech stops and start a new audio buffer, in theory you could just string together an endless audio buffer and keep feeding it to the model, though this would make it take longer to transcribe each time.
Beta Was this translation helpful? Give feedback.
All reactions