Confusing Benchmark results 4080 Super vs A10 vs H100 #2142

ZewiHugo · 2024-04-20T07:38:38Z

ZewiHugo
Apr 20, 2024

Hello everyone, I'm trying to determine the latency I'll get when I use whisper model to server hundreds of request at the same time. Ultimately I want to figure out how many GPU I need to server 100 requests at the same time with < 1s latency assuming each requests contain < 30 seconds audio.

I ended up testing on 3 different devices and got unexpected results. For my tests, I'm using hugging face's translation library as it seems to be one of fastest implementation I can find (with flash attention & batch processing). I'm using the distil-whisper/distil-medium.en model to further speed up the inference.
The task is to recognize 50 long .wav audio (30 seconds) and 50 should .wav audio (3 seconds)

I tested on 3 different environment

RTX 4080 Super + Windows (Local machine). Fastest: 15 seconds with batch size >= 16
Nvidia A10 + Ubuntu (Lambda labs): Fastest: 24 seconds with batch size >= 8 (Run in jupyter notebook)
Nvidia H100 + Ubuntu(Lambda labs): Fastest: 23 seconds with batch size >= 64 (Run in jupyter notebook ipython envrionment)

The above results confused me, considering that the computation mainly relies on fp16 , and RTX 4080 Super, A10, H100 have 52.22 / 125 / 1513 TFLOPS, I would expect that the model inference speed have H100 > A10 > RTX4080. Unless I misunderstood something here.

Does the above results make sense, could anyone explain to me why this may happen? My code for testing are pasted below

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import time




device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device: " + device)
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-medium.en"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)














start_time = time.time()
filename_list = ['./download.wav', './download-long.wav'] * 50
# Process audio file
results = pipe(filename_list, batch_size=16)
end_time = time.time()
print("Time taken start to end (seconds):", end_time - start_time)
print([result["text"] for result in results])

qxprakash · 2024-04-22T02:37:43Z

qxprakash
Apr 22, 2024

Hi @ZewiHugo I am also interested in finding out this , did you tried benchmarking with whisper.cpp or faster whisper, also did you know any project which enables real time streaming transcriptions using whisper ?

1 reply

ZewiHugo Apr 22, 2024
Author

I didn't try whisper.cpp/faster whisper to run the benchmark.
The main reason is I don't know how to ask faster-whisper to take multiple input at the same time. I did run the local tests using faster whisper and in my local environment hugging face whisper with flash attention is faster then faster whisper. So when I test in cloud I simple go with fastest option.

AFAIK, didn't know any whisper model have enables streaming transcriptions, mostly require one to finish talk and work on the whole audio trunk. I don't think whisper model are build with streaming use case in mind. But if it's fast enough, we can make it "feel" like streaming even if it's just doing batch transcribe. Just sad to see even H100 are not able to speed things up :(

phineas-pta · 2024-04-25T19:19:35Z

phineas-pta
Apr 25, 2024

a10 & h100 are more suitable for training

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing Benchmark results 4080 Super vs A10 vs H100 #2142

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Confusing Benchmark results 4080 Super vs A10 vs H100 #2142

ZewiHugo Apr 20, 2024

Replies: 2 comments · 1 reply

qxprakash Apr 22, 2024

ZewiHugo Apr 22, 2024 Author

phineas-pta Apr 25, 2024

ZewiHugo
Apr 20, 2024

Replies: 2 comments 1 reply

qxprakash
Apr 22, 2024

ZewiHugo Apr 22, 2024
Author

phineas-pta
Apr 25, 2024