Replies: 2 comments 1 reply
-
Hi @ZewiHugo I am also interested in finding out this , did you tried benchmarking with whisper.cpp or faster whisper, also did you know any project which enables real time streaming transcriptions using whisper ? |
Beta Was this translation helpful? Give feedback.
1 reply
-
a10 & h100 are more suitable for training |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone, I'm trying to determine the latency I'll get when I use whisper model to server hundreds of request at the same time. Ultimately I want to figure out how many GPU I need to server 100 requests at the same time with < 1s latency assuming each requests contain < 30 seconds audio.
I ended up testing on 3 different devices and got unexpected results. For my tests, I'm using hugging face's translation library as it seems to be one of fastest implementation I can find (with flash attention & batch processing). I'm using the distil-whisper/distil-medium.en model to further speed up the inference.
The task is to recognize 50 long .wav audio (30 seconds) and 50 should .wav audio (3 seconds)
I tested on 3 different environment
The above results confused me, considering that the computation mainly relies on fp16 , and RTX 4080 Super, A10, H100 have 52.22 / 125 / 1513 TFLOPS, I would expect that the model inference speed have H100 > A10 > RTX4080. Unless I misunderstood something here.
Does the above results make sense, could anyone explain to me why this may happen? My code for testing are pasted below
Beta Was this translation helpful? Give feedback.
All reactions