Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server example? #1369

Open
Azeirah opened this issue Oct 16, 2023 · 23 comments · May be fixed by #1375
Open

Server example? #1369

Azeirah opened this issue Oct 16, 2023 · 23 comments · May be fixed by #1375
Labels
enhancement New feature or request

Comments

@Azeirah
Copy link

Azeirah commented Oct 16, 2023

I'm working on a voice-controlled application and I want to run small .wav files through whisper fairly often.

What I noticed is that it takes almost 50% of the total time just to load the model every single time I run ./main -m ... "my-short-spoken-command.wav"

I think it'd be nice if like in llama.cpp this project includes a server example so the model only has to be loaded once and stays in memory after loading.

@Azeirah
Copy link
Author

Azeirah commented Oct 16, 2023

For what it's worth, I already have a very rudimentary server example working. It's a bit of a frankenstein copy-paste work of whisper/examples/main and llama.cpp/examples/server/server.cpp but it works. I'm not great at c++ whatsoever so I was happy to be able to copy and paste almost everything from those two examples.

It supports configuring the server in the exact same way as the llama server, and it supports (untested) these params:

    int32_t n_threads = std::min(12, (int32_t) std::thread::hardware_concurrency());
    int32_t n_processors = 1;
    int32_t offset_t_ms = 0;
    int32_t offset_n = 0;
    int32_t duration_ms = 0;
    int32_t progress_step = 5;
    int32_t max_context = -1;
    int32_t max_len = 0;
    int32_t best_of = 2;
    int32_t beam_size = -1;
    std::string model = "models/ggml-base.en.bin"

But not diarization, language or any output option, my goal was to get a working server for my own application.

Anyone interested in a PR?

@bobqianic bobqianic added the enhancement New feature or request label Oct 16, 2023
@FSSRepo
Copy link

FSSRepo commented Oct 17, 2023

Maybe, when I finish working on optimizing stable-diffusion.cpp and adding a server to it, I could create a server example for whisper.cpp.

@Azeirah
Copy link
Author

Azeirah commented Oct 18, 2023

I posted the code as a PR #1375

@bobqianic bobqianic linked a pull request Oct 20, 2023 that will close this issue
7 tasks
@ggerganov
Copy link
Owner

Hey all, I notice several server examples being proposed. This is super cool!

I'm planning to do a major update to whisper.cpp in the following few days, bringing some new features and performance improvements. This will be the highest priority, so for the sake of less distractions, the server examples will have to wait until we finish the new release. Sorry for the delay

@Azeirah Azeirah linked a pull request Nov 3, 2023 that will close this issue
7 tasks
@ggerganov
Copy link
Owner

Hi again! I think we should restart the server efforts now that v1.5.0 is released.

I like both #1375 and #1380, so not sure how to decide which one to integrate.
@Azeirah and @felrock (and others): do you have any opinion on this.

Also, I think we should aim to support the OpenAI Audio API for speech to text: https://platform.openai.com/docs/api-reference/audio

The approach in #1418 is also interesting, so it can be merged as an alternative solution to the REST-based server example.

@felrock
Copy link
Collaborator

felrock commented Nov 16, 2023

Hello! I'm keen on fixing and merging my changes for the server. I've seen that the use for server in llama.cpp has enabled projects such as ollama and others. So i think it's an important application to have, for users to easily create interfaces against.

I have also started to create a similar server solution for bark.cpp because in my use case I would like to have some sort of voice(a bit more granular than espeak). Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).

@ggerganov
Copy link
Owner

Yup, I agree that a server can find many interesting applications.

Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).

Yes! Great idea - we are getting close :)

@colinator
Copy link

Also agree. To hawk my proposal #1418 (that fork is a bit messy, but something like it) - I think it'd be really great to have the ability to create many types of servers. For instance, I might want a gRPC server. Or a rest server. Or a ROS pub-sub node. Likewise: many types of encodings for the result: maybe json, maybe bson, maybe protobuf, etc. I think it'd require very little refactoring - basically, just the core stream server as a class with a method infer(audio data*). Happy to help!

@nortekax
Copy link

Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).
The best I found for voice is https://github.com/rhasspy/piper , it has a nice sounding voice and it is faster than bark llm.

@ggerganov
Copy link
Owner

First pass of a server example has been merged (#1380).

Looks like streaming and diarization are 2 of the most requested features for the server. Not sure if we can do something meaningful for diarization, but we should able to provide a streaming API relatively easy.

@felrock
Copy link
Collaborator

felrock commented Nov 20, 2023

I left the diarization parameters in there so it might be working, I didn't know how it worked or how to test it.

@nortekax
Copy link

The server works well, but when speech is short, like "lights on", "lights off", etc, it doesn't produce any text.

I suspect whisper.cpp needs a long context because the command example asks for a long sentence first, before it can work properly. The command example tells the user:

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

I think a way to provide a context for the server (like command does) would be useful to provide agents that need short commands, like "lights on", "lights off", etc.

@felrock
Copy link
Collaborator

felrock commented Nov 23, 2023

Ok, I've tried sending single word .wav files to the server and have it respond with the correct work. Did you try using the prompt flag? Should be something similar to what you describe.

@ggerganov
Copy link
Owner

It should work with short audio too. The prompt can help in some situations to make the transcript more robust, but is not required in general.

@Azeirah
Copy link
Author

Azeirah commented Nov 23, 2023

The server works well, but when speech is short, like "lights on", "lights off", etc, it doesn't produce any text.

I suspect whisper.cpp needs a long context because the command example asks for a long sentence first, before it can work properly. The command example tells the user:

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

I think a way to provide a context for the server (like command does) would be useful to provide agents that need short commands, like "lights on", "lights off", etc.

I haven't tested this specific server implementation, but the server implementation I was using previously definitely did work with short commands, I specifically made it for that purpose.

So either

  • There's something wrong with your data
  • There's something wrong with the glue between your data and the server
  • There's something wrong with the server implementation

Have you tried it with longer audio?

@nortekax
Copy link

nortekax commented Nov 23, 2023

Have you tried it with longer audio?

Longer audio always works well; the same problem that happens with server also happens with main. Only command works well with the short voice commands, but only after you say the first long sentence that command asks you to say.

Edit: I am using sox to make a wav file, @Azeirah what do you use to make the wav file?

@nortekax
Copy link

From @felrock

Ok, I've tried sending single word .wav files to the server and have it respond with the correct work. Did you try using the prompt flag? Should be something similar to what you describe.

From @ggerganov

It should work with short audio too. The prompt can help in some situations to make the transcript more robust, but is not required in general.

I am using the following (based on https://stackoverflow.com/questions/30006609/using-sox-for-voice-detection-and-streaming) to generate the wav:

sox -q -c 1 -r 16000 -d  -b 16 -r 16000 "$outwav"  silence 1 0.3 1% 1 0.3 1%

I play the wav and it is okay. How do you generate your wavs ?

@felrock
Copy link
Collaborator

felrock commented Nov 23, 2023

Im using a python script, which uses PyAudio. I record for about three seconds per .wav file

@nortekax
Copy link

thanks, how do you detect voice?

@nortekax
Copy link

Okay, I solved the problem. Post it here for those interested. I simply padded the wav file with 500ms of silence at the beginning and 1 second of silence at the end, and everything works fine now.

@nortekax
Copy link

It would be really useful to add the grammar/commands.txt functionality in the command example.

@ggerganov
Copy link
Owner

Ah sorry, about that - I forgot there is logic to ignore sub one second audio:

whisper.cpp/whisper.cpp

Lines 5193 to 5198 in a5881d6

// of only 1 second left, then stop
if (seek + 100 >= seek_end) {
break;
}

@bobqianic
Copy link
Collaborator

bobqianic commented Dec 10, 2023

Ah sorry, about that - I forgot there is logic to ignore sub one second audio:

whisper.cpp/whisper.cpp

Lines 5193 to 5198 in a5881d6

// of only 1 second left, then stop
if (seek + 100 >= seek_end) {
break;
}

BTW, why we need this logic here?

See #1603

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants