Update latency test script due to deprecation in vllm (#2973)

jerryzh168 · web-flow · commit 186aeb016646 · 2025-09-10T15:36:03.000-07:00
Summary:
For evaluating latency, currently we use python benchmarks/benchmark_latency.py but it is deprecated recently:
```
DEPRECATED: This script has been moved to the vLLM CLI.

Please use the following command instead:
    vllm bench latency

For help with the new command, run:
    vllm bench latency --help

Alternatively, you can run the new command directly with:
    python -m vllm.entrypoints.cli.main bench latency --help
```

So we updated it to use `vllm bench latency` instead

Test Plan:
sh eval.sh --eval_type latency --model_ids Qwen/Qwen3-8B

Reviewers:

Subscribers:

Tasks:

Tags:
diff --git a/.github/scripts/torchao_model_releases/eval_latency.sh b/.github/scripts/torchao_model_releases/eval_latency.sh
@@ -75,7 +75,7 @@ for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
     for BATCH_SIZE in "${BATCH_SIZE_ARRAY[@]}"; do
         OUTPUT_FILE="$ORIG_DIR/${SAFE_MODEL_ID}_latency_batch${BATCH_SIZE}_in${INPUT_LEN}_out${OUTPUT_LEN}.log"
         echo "Running latency eval for model $MODEL_ID with batch size $BATCH_SIZE with input length: $INPUT_LEN and output length: $OUTPUT_LEN"
-        VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
+        VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
         echo "Latency eval result saved to $OUTPUT_FILE"
     done
     echo "======================== Eval Latency $MODEL_ID End ========================="