You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It really depends on how you do cost benefit analysis. I assume server is keeping model hot in RAM since loading it with small.en is at 1.1G idle memory. On the other hand, let's say you use it only a few times per day, does it make sense for you to keep it in memory all day long? Maybe if you have a lot of RAM it makes sense for you. But other people probably needs those RAM for other interesting things they do.
And really, what exactly are we saving here? Even a garbage tier SSD should be able to do 1GB sequential read to load model in less than a second. Is this really what you want to optimise for?
The program "server" as the name suggests is appropriate for people who self-hosts a transcription service on a dedicated hardware for lots of customers, I don't really see how it makes any sense for a normal desktop user to do it this way.
Also I don't see how it has anything to do with whisperlive? That issue is for people who want real-time transcribing. The "server" is not real-time, it's doing a single request/response loop, only difference is in IPC transport (TCP vs unix pipes).
whisper.cpp ships with a server. Isn't using that faster than loading the model again for each request?
Doing this should be much easier than #22.
The text was updated successfully, but these errors were encountered: