You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Using profile results from tests/planner/profiling_results/H200_TP1P_TP1D/
54
-
>
55
-
> Interpolating prefill performance ...
56
-
> Estimated TTFT=0.027s <= target TTFT=0.100s. Requests can queue 0.073s maximally while meeting TTFT SLA.
57
-
> Estimated throughput: 110893.48 tokens/s/gpu. Request rate at 36.96 requests/s will saturate one GPU.
51
+
# output:
52
+
ISL=3000, OSL=300
53
+
TTFT=0.1s, ITL=0.01s
54
+
Using profile results from tests/planner/profiling_results/H200_TP1P_TP1D/
55
+
56
+
Interpolating prefill performance ...
57
+
Estimated TTFT=0.060s <= target TTFT=0.100s. Requests can queue 0.040s maximally while meeting TTFT SLA.
58
+
Estimated throughput: 49481.09 tokens/s/gpu. Request rate at 16.49 requests/s will saturate one GPU.
58
59
59
60
Interpolating decode performance ...
60
-
> Average context length: isl + osl/2 = 3150.
61
-
> Estimated ITL=0.0098s<= target ITL=0.0100s at 36.36% active kv usage.
62
-
> Estimated throughput: 10009.88 token/s/gpu. Request rate at 33.37 requests/s will saturate one GPU.
61
+
Average context length: isl + osl/2 = 3150.
62
+
Estimated ITL=0.0097s<= target ITL=0.0100s at 16.16% active kv usage.
63
+
Estimated throughput: 4555.68 token/s/gpu. Request rate at 15.19 requests/s will saturate one GPU.
63
64
```
64
65
65
66
## Generating Load Dataset
66
67
67
68
We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../benchmarks/sin_load_generator/README.md).
68
69
69
-
From previous interpolator testing, ISL 3000 and OSL 300 can handle ~30 request/s/gpu for both prefill and decode.
70
-
To test planner's performance for different request rates, we can generate a load dataset with request rate varying between 20 to 80 request/s.
70
+
From previous interpolator testing, ISL 3000 and OSL 300 can handle ~15 request/s/gpu for both prefill and decode.
71
+
To test planner's performance for different request rates, we can generate a load dataset with request rate varying between 12 to 36 request/s.
71
72
For TP1 H200 engine, planner should scale between 1P1D and 3P3D.
0 commit comments