You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some performace test of a 3.5B bloom 1-gpu model using perf_analyzer, the result is
batch size
avg latency
1
6533769us
2
2819328us
4
2953732us
and then I tested 2-gpu version of this model, the result is
batch size
avg latency
1
1769113us
2
3032188us
4
3461972us
1-gpu model is much slower when processing 1 size batch than processing 2 or 4 size batch, and 2-gpu model is faster than 1-gpu model when processing 1 size batch but slower when batch size > 1, how to explain these results? please help.
GPU: Tesla T4
CUDA Version: 11.8
model config: almost same as all_models/bloom
input data:
I did some performace test of a 3.5B bloom 1-gpu model using perf_analyzer, the result is
and then I tested 2-gpu version of this model, the result is
1-gpu model is much slower when processing 1 size batch than processing 2 or 4 size batch, and 2-gpu model is faster than 1-gpu model when processing 1 size batch but slower when batch size > 1, how to explain these results? please help.
GPU: Tesla T4
CUDA Version: 11.8
model config: almost same as all_models/bloom
input data:
perf_analyzer command
The text was updated successfully, but these errors were encountered: