Server example? #1369

Azeirah · 2023-10-16T17:51:13Z

I'm working on a voice-controlled application and I want to run small .wav files through whisper fairly often.

What I noticed is that it takes almost 50% of the total time just to load the model every single time I run ./main -m ... "my-short-spoken-command.wav"

I think it'd be nice if like in llama.cpp this project includes a server example so the model only has to be loaded once and stays in memory after loading.

The text was updated successfully, but these errors were encountered:

Azeirah · 2023-10-16T17:53:43Z

For what it's worth, I already have a very rudimentary server example working. It's a bit of a frankenstein copy-paste work of whisper/examples/main and llama.cpp/examples/server/server.cpp but it works. I'm not great at c++ whatsoever so I was happy to be able to copy and paste almost everything from those two examples.

It supports configuring the server in the exact same way as the llama server, and it supports (untested) these params:

    int32_t n_threads = std::min(12, (int32_t) std::thread::hardware_concurrency());
    int32_t n_processors = 1;
    int32_t offset_t_ms = 0;
    int32_t offset_n = 0;
    int32_t duration_ms = 0;
    int32_t progress_step = 5;
    int32_t max_context = -1;
    int32_t max_len = 0;
    int32_t best_of = 2;
    int32_t beam_size = -1;
    std::string model = "models/ggml-base.en.bin"

But not diarization, language or any output option, my goal was to get a working server for my own application.

Anyone interested in a PR?

FSSRepo · 2023-10-17T00:49:26Z

Maybe, when I finish working on optimizing stable-diffusion.cpp and adding a server to it, I could create a server example for whisper.cpp.

Azeirah · 2023-10-18T17:30:44Z

I posted the code as a PR #1375

ggerganov · 2023-11-03T08:37:50Z

Hey all, I notice several server examples being proposed. This is super cool!

I'm planning to do a major update to whisper.cpp in the following few days, bringing some new features and performance improvements. This will be the highest priority, so for the sake of less distractions, the server examples will have to wait until we finish the new release. Sorry for the delay

ggerganov · 2023-11-16T08:31:26Z

Hi again! I think we should restart the server efforts now that v1.5.0 is released.

I like both #1375 and #1380, so not sure how to decide which one to integrate.
@Azeirah and @felrock (and others): do you have any opinion on this.

Also, I think we should aim to support the OpenAI Audio API for speech to text: https://platform.openai.com/docs/api-reference/audio

The approach in #1418 is also interesting, so it can be merged as an alternative solution to the REST-based server example.

felrock · 2023-11-16T10:02:24Z

Hello! I'm keen on fixing and merging my changes for the server. I've seen that the use for server in llama.cpp has enabled projects such as ollama and others. So i think it's an important application to have, for users to easily create interfaces against.

I have also started to create a similar server solution for bark.cpp because in my use case I would like to have some sort of voice(a bit more granular than espeak). Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).

ggerganov · 2023-11-16T10:12:51Z

Yup, I agree that a server can find many interesting applications.

Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).

Yes! Great idea - we are getting close :)

colinator · 2023-11-18T20:36:22Z

Also agree. To hawk my proposal #1418 (that fork is a bit messy, but something like it) - I think it'd be really great to have the ability to create many types of servers. For instance, I might want a gRPC server. Or a rest server. Or a ROS pub-sub node. Likewise: many types of encodings for the result: maybe json, maybe bson, maybe protobuf, etc. I think it'd require very little refactoring - basically, just the core stream server as a class with a method infer(audio data*). Happy to help!

nortekax · 2023-11-20T19:32:53Z

Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).
The best I found for voice is https://github.com/rhasspy/piper , it has a nice sounding voice and it is faster than bark llm.

ggerganov · 2023-11-20T19:47:44Z

First pass of a server example has been merged (#1380).

Looks like streaming and diarization are 2 of the most requested features for the server. Not sure if we can do something meaningful for diarization, but we should able to provide a streaming API relatively easy.

felrock · 2023-11-20T20:15:53Z

I left the diarization parameters in there so it might be working, I didn't know how it worked or how to test it.

nortekax · 2023-11-23T04:43:33Z

The server works well, but when speech is short, like "lights on", "lights off", etc, it doesn't produce any text.

I suspect whisper.cpp needs a long context because the command example asks for a long sentence first, before it can work properly. The command example tells the user:

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

I think a way to provide a context for the server (like command does) would be useful to provide agents that need short commands, like "lights on", "lights off", etc.

felrock · 2023-11-23T08:13:30Z

Ok, I've tried sending single word .wav files to the server and have it respond with the correct work. Did you try using the prompt flag? Should be something similar to what you describe.

ggerganov · 2023-11-23T10:12:30Z

It should work with short audio too. The prompt can help in some situations to make the transcript more robust, but is not required in general.

Azeirah · 2023-11-23T11:38:35Z

The server works well, but when speech is short, like "lights on", "lights off", etc, it doesn't produce any text.

I suspect whisper.cpp needs a long context because the command example asks for a long sentence first, before it can work properly. The command example tells the user:
process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'
I think a way to provide a context for the server (like command does) would be useful to provide agents that need short commands, like "lights on", "lights off", etc.

I haven't tested this specific server implementation, but the server implementation I was using previously definitely did work with short commands, I specifically made it for that purpose.

So either

There's something wrong with your data
There's something wrong with the glue between your data and the server
There's something wrong with the server implementation

Have you tried it with longer audio?

nortekax · 2023-11-23T14:23:11Z

Have you tried it with longer audio?

Longer audio always works well; the same problem that happens with server also happens with main. Only command works well with the short voice commands, but only after you say the first long sentence that command asks you to say.

Edit: I am using sox to make a wav file, @Azeirah what do you use to make the wav file?

nortekax · 2023-11-23T17:58:02Z

From @felrock

Ok, I've tried sending single word .wav files to the server and have it respond with the correct work. Did you try using the prompt flag? Should be something similar to what you describe.

From @ggerganov

It should work with short audio too. The prompt can help in some situations to make the transcript more robust, but is not required in general.

I am using the following (based on https://stackoverflow.com/questions/30006609/using-sox-for-voice-detection-and-streaming) to generate the wav:

sox -q -c 1 -r 16000 -d  -b 16 -r 16000 "$outwav"  silence 1 0.3 1% 1 0.3 1%

I play the wav and it is okay. How do you generate your wavs ?

felrock · 2023-11-23T18:30:48Z

Im using a python script, which uses PyAudio. I record for about three seconds per .wav file

nortekax · 2023-11-23T19:29:53Z

thanks, how do you detect voice?

nortekax · 2023-11-24T02:07:12Z

Okay, I solved the problem. Post it here for those interested. I simply padded the wav file with 500ms of silence at the beginning and 1 second of silence at the end, and everything works fine now.

nortekax · 2023-11-24T02:11:16Z

It would be really useful to add the grammar/commands.txt functionality in the command example.

ggerganov · 2023-11-24T07:45:07Z

Ah sorry, about that - I forgot there is logic to ignore sub one second audio:

whisper.cpp/whisper.cpp

Lines 5193 to 5198 in a5881d6

    
           // of only 1 second left, then stop 
        
           if (seek + 100 >= seek_end) { 
        
               break; 
        
           }

bobqianic · 2023-12-10T17:57:14Z

Ah sorry, about that - I forgot there is logic to ignore sub one second audio:

whisper.cpp/whisper.cpp

Lines 5193 to 5198 in a5881d6

// of only 1 second left, then stop

if (seek + 100 >= seek_end) {

break;

}

BTW, why we need this logic here?

See #1603

bobqianic added the enhancement New feature or request label Oct 16, 2023

bobqianic linked a pull request Oct 20, 2023 that will close this issue

Draft: Basic example server #1375

Open

7 tasks

Azeirah linked a pull request Nov 3, 2023 that will close this issue

Draft: Basic example server #1375

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server example? #1369

Server example? #1369

Azeirah commented Oct 16, 2023

Azeirah commented Oct 16, 2023 •

edited

Loading

FSSRepo commented Oct 17, 2023

Azeirah commented Oct 18, 2023

ggerganov commented Nov 3, 2023

ggerganov commented Nov 16, 2023

felrock commented Nov 16, 2023

ggerganov commented Nov 16, 2023

colinator commented Nov 18, 2023

nortekax commented Nov 20, 2023

ggerganov commented Nov 20, 2023

felrock commented Nov 20, 2023

nortekax commented Nov 23, 2023

felrock commented Nov 23, 2023

ggerganov commented Nov 23, 2023

Azeirah commented Nov 23, 2023

nortekax commented Nov 23, 2023 •

edited

Loading

nortekax commented Nov 23, 2023

felrock commented Nov 23, 2023

nortekax commented Nov 23, 2023

nortekax commented Nov 24, 2023

nortekax commented Nov 24, 2023

ggerganov commented Nov 24, 2023

bobqianic commented Dec 10, 2023 •

edited

Loading

Server example? #1369

Server example? #1369

Comments

Azeirah commented Oct 16, 2023

Azeirah commented Oct 16, 2023 • edited Loading

FSSRepo commented Oct 17, 2023

Azeirah commented Oct 18, 2023

ggerganov commented Nov 3, 2023

ggerganov commented Nov 16, 2023

felrock commented Nov 16, 2023

ggerganov commented Nov 16, 2023

colinator commented Nov 18, 2023

nortekax commented Nov 20, 2023

ggerganov commented Nov 20, 2023

felrock commented Nov 20, 2023

nortekax commented Nov 23, 2023

felrock commented Nov 23, 2023

ggerganov commented Nov 23, 2023

Azeirah commented Nov 23, 2023

nortekax commented Nov 23, 2023 • edited Loading

nortekax commented Nov 23, 2023

felrock commented Nov 23, 2023

nortekax commented Nov 23, 2023

nortekax commented Nov 24, 2023

nortekax commented Nov 24, 2023

ggerganov commented Nov 24, 2023

bobqianic commented Dec 10, 2023 • edited Loading

Azeirah commented Oct 16, 2023 •

edited

Loading

nortekax commented Nov 23, 2023 •

edited

Loading

bobqianic commented Dec 10, 2023 •

edited

Loading