Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing on either Mac or Linux #3

Open
chadexplains opened this issue May 23, 2023 · 11 comments
Open

Installing on either Mac or Linux #3

chadexplains opened this issue May 23, 2023 · 11 comments

Comments

@chadexplains
Copy link

A few questions:

  1. Does this need a GPU or is performance OK on CPU?
  2. Where do you set the API key? Would be good to add to README
  3. I can't build on Mac or Linux yet - different errors on each but will keep this thread updated as I go
@nalbion
Copy link
Owner

nalbion commented May 23, 2023

It works pretty well on a CPU (if you can get it working)

You don't need to use an API key because you're running the server locally and it doesn't need to connect to OpenAI.

I was able to build on Linux in the Dockerfile, but I had issues mounting my audio devices into the container (even though I can run whisper.cpp in an Ubuntu Docker container through WSL.

@chadexplains
Copy link
Author

I'll try on Ubuntu through Linux on my t2.medium ec2 instance

A little confused about the underlying model being used - you are saying it is served locally, so did OpenAI allow whisper to be downloaded and used? I had assumed all whisper access was via API only (all the GPT models).

To simplify my ask: what's your recommendation for how to spin up this server? Is it Linux + Docker and should hopefully work?

And is it whisper faster by default (with a possibility to switch to whisper.cpp)?

@nalbion
Copy link
Owner

nalbion commented May 23, 2023

If you plan to run it on an ec2 instance your needs are different from mine. I run it locally to provide text-to-speech to a non-python app. It could be on the same server, or same LAN, but I'm not streaming audio from client to whisper_server. whisper_server owns the mic.

You could adapt this to support streaming audio from the client, but you'd also have to make other changes to support multiple users:

  • you'd need to run a new run_whisper_loop for each user connection.
  • you may need different DecodingOptions for task, language, context

@chadexplains
Copy link
Author

Got it, let me clarify more since it's always helpful to provide as much context as possible to LLMs.

I'm working on a web based (desktop for now) application that does real-time voice (user in the browser) capture and transcription -> response from an LLM -> back to browser as text.

This should run (both web server and transcription) on an EC2 instance. Ideally both can be in python (and connect to a react app in the client, though this part is likely not relevant).

Needs are to be close to real time (let's say 5-20s voice clips from user -- with let's say 5-20s latency e2e being acceptable).

Happy to follow your guidance on how I should be thinking about this. Here's my current view:

  • python webserver
  • OpenAi based LLM (for both the the generation and transcription part)
  • probably using whisper faster (for latency?)
  • maybe stream transcription but even waiting the full 20s to hear the users clip and transcribe it whole hog would work
  • multiple users are expected

WDYT?

@nalbion
Copy link
Owner

nalbion commented May 23, 2023

Do you need to implement your own speech-to-text, rather than use the Web Speech API?

https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

@chadexplains
Copy link
Author

Assume yes :)

@nalbion
Copy link
Owner

nalbion commented May 23, 2023

Can you get the build to work if you delete requirements.txt and run

python -m piptools compile requirements.in --resolver=backtracking

@chadexplains
Copy link
Author

chadexplains commented May 24, 2023

I'll try that - but did you have a general recommendation on the design I want here?

For example: I have a python web server already -- should I be looking into another solution instead?

@nalbion
Copy link
Owner

nalbion commented May 24, 2023

If you specifically want to use Whisper, this repo would give you a good starting point.

You'd probably want to stream audio using WebRTC, run a separate thread/process for each connection. This code demonstrates that you don't have to wait for the full 20/30 seconds

@chadexplains
Copy link
Author

That makes sense - seems like whisper faster integration is what I would want.

One nit: why separate thread/process for each connection -- is that so audio transcription is non blocking for the webserver? I'm on flask right now so it's sync and single threaded IIRC. Will I get poor performance if I just synchronously try and handle the webrtc audio -> openai RPC?

It's not clear to me if python will automatically release GIL on the RPC and therefore "just work" without a performance hit?

Super curious how you think about this

@nalbion
Copy link
Owner

nalbion commented May 24, 2023

TBH, I'm not sure if your webserver would be handling each request on a separate thread, from what you say I'm guessing it's similar to NodeJS?

NodeJS will try to handle other requests when one request wants to do some I/O and maybe Whisper would allow other requests to use the CPU when some GPU operations were being performed. You'd have to do some load testing to ensure that the web server can still handle regular API & static content requests while multiple users were having speech-to-text processed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants