Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192

dariopellegrino00 · 2024-12-06T13:48:00Z

Hi, i'm currently working on my thesis, which involves building a real-time transcription server using the whisper-streaming project and faster-whisper for the ASR backend. The server is deployed on an RTX 6000 Ada GPU, but I am struggling to achieve proper GPU parallelism.
I am relatively new to using Whisper and have only recently started using Python. I appreciate your patience and any guidance you can provide!

I Tried

Multiple Models on Multiple Threads:

I instantiated multiple WhisperModel instances (one per thread) and assigned each client to its own model. While this approach works for a few clients, performance degrades significantly beyond ~8 clients, regardless of the model size. Visually what seems to be happening to me is that the models are competing with each other for the entire GPU resources.

Single Shared Model with num_workers:

I shared a single WhisperModel instance among multiple threads and used the num_workers parameter to enable concurrent processing. This approach also works well initially but similarly fails to handle more than ~8 clients effectively, again with the same issues.

Is there a way to achieve true GPU parallelism for multiple audio sources on a single GPU using faster-whisper?
Does the num_workers parameter have any impact on GPU-based inference, or is it exclusively for CPU execution?
Are there recommended configurations or best practices for maximizing GPU utilization in scenarios with multiple concurrent audio streams?

Any advice or clarification would be greatly appreciated. Thank you for your amazing work on this project!

The text was updated successfully, but these errors were encountered:

heimoshuiyu · 2024-12-06T15:42:40Z

Hello, I wrote a very simple FastAPI script to run the faster whisper module. The script is available here. https://github.com/heimoshuiyu/whisper-fastapi

I use Docker to deploy my service. I have 4 RTX 4070 Ti Super GPUs, and I deploy 2 services on each GPU. So in total, I have 8 services. Each service is mapped to one client, and I set up the Grafana GPU monitor, which indicates all GPUs are utilized at 100%.

I am using the large v2 module, and almost all my transcription tasks are longer than 1 to 3 minutes of audio. The feature extraction preprocessing part isn't wasting too much GPU time, I think. In my case, two services per GPU is enough. You might consider using more service per GPU if you are transcribing shorter audio.

dariopellegrino00 · 2024-12-06T16:28:45Z

Thank you very much for the response. I will take a look at your solution as soon as I can.

MahmoudAshraf97 · 2024-12-06T19:54:27Z

@heimoshuiyu unfortunately that script is not utilizing the gpus correctly even if it shows 100% utilization
The correct way to utilize multiple gpus is to run a single model instance with multiple device indices and use model.model.generate function asynchronously
This is not implemented in faster whiper because it needs batching to actually saturate the gpu so sequential transcription will not benefit, and the batched transcription needs the encoder output which cannot utilize multiple gpus effeciently, I'll try to think of a good implementation to use multiple gpus effeciently

dariopellegrino00 · 2024-12-12T09:04:49Z

Hi @MahmoudAshraf97, sorry to bother you. Do you have any suggestions on how to efficiently transcribe multiple audio sources in parallel using Faster Whisper? I’d appreciate any insights or recommendations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192

Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192

dariopellegrino00 commented Dec 6, 2024

heimoshuiyu commented Dec 6, 2024

dariopellegrino00 commented Dec 6, 2024

MahmoudAshraf97 commented Dec 6, 2024

dariopellegrino00 commented Dec 12, 2024

Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192

Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192

Comments

dariopellegrino00 commented Dec 6, 2024

I Tried

heimoshuiyu commented Dec 6, 2024

dariopellegrino00 commented Dec 6, 2024

MahmoudAshraf97 commented Dec 6, 2024

dariopellegrino00 commented Dec 12, 2024