[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371

kiratp · 2025-01-23T21:33:23Z

Your current environment

The environment is the TPU nightly docker image

Model Input Dumps

No response

🐛 Describe the bug

Model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
Machine: TPU v6e-8
Image: vlm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c

I would suggest running this in the TPU VM using tmux

First, start the server

docker run --privileged -it --network host --rm -v /dev/shm:/data -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} -e VLLM_XLA_CACHE_PATH=/data/jax --shm-size=10.24gb vllm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c python3 -m vllm.entrypoints.openai.api_server --host=0.0.0.0 --port=8000 --tensor-parallel-size=8 --max-model-len=65536 --gpu-memory-utilization=0.75 --max-num-seqs=32 --model=neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8 --download-dir /data --disable-log-requests --enable_prefix_caching

Run the benchmark from another container instance (tmux pane)

docker run -it --rm --network host vllm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c  
python3 -m pip install -r requirements-test.txt  
cd benchmarks  
python3 benchmark_serving.py --model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8 --dataset-name sonnet --dataset-path sonnet.txt --num-prompts 32 --sonnet-input-len 65536 --sonnet-output-len 4096 --sonnet-prefix-len 32768 --port 8000

Performance is really degraded:

Under certain conditions that are difficult to replicate on deman but have occured twice, the server eventually gets locked into a corrupted state and the client just gets max_tokens worth of garbage output

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<|reserved_special_token_247|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<|reserved_special_token_247|><|reserved_special_token_247|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247>

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

kiratp · 2025-01-23T21:35:59Z

Related: when running this same tests with shorter sequences, quantized throughput is lower than unquantized

Quantized:

Unquantized:

bvrockwell · 2025-01-23T21:42:11Z

cc @richardsliu @dyli-google

robertgshaw2-redhat · 2025-01-23T22:01:32Z

Thanks for the report.

robertgshaw2-redhat · 2025-01-23T23:31:58Z

Do you only see this issue with W8A8 compute?

kiratp · 2025-01-24T18:36:05Z

@robertgshaw2-redhat - IIRC from my notes, yes. However, if you can confirm this test matrix, I can get you a report for each config:

Quantization: on vs off (i.e. neuralmagic vs meta default model)
Prefix caching: on vs off.

Any other config?

Note that we have build from december from this branch that works fine for w8a8 (no prefix caching of course) #10435

dyli-google · 2025-01-24T20:51:39Z

@kiratp So this issue happens only when both quantization and prefix caching are on, right?

This makes sense, because each was tested separately. I don't think we tested them together.

Prefix caching: github.com//pull/10307
Quantization: github.com//pull/11785

Could you please send me and @robertgshaw2-redhat the more concrete reports? Thanks.

miladm · 2025-01-25T16:00:52Z

cc @lsy323

kiratp added the bug Something isn't working label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371

[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371

kiratp commented Jan 23, 2025 •

edited

Loading

kiratp commented Jan 23, 2025 •

edited

Loading

bvrockwell commented Jan 23, 2025 •

edited

Loading

robertgshaw2-redhat commented Jan 23, 2025

robertgshaw2-redhat commented Jan 23, 2025

kiratp commented Jan 24, 2025

dyli-google commented Jan 24, 2025

miladm commented Jan 25, 2025 •

edited

Loading

[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371

[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371

Comments

kiratp commented Jan 23, 2025 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

kiratp commented Jan 23, 2025 • edited Loading

bvrockwell commented Jan 23, 2025 • edited Loading

robertgshaw2-redhat commented Jan 23, 2025

robertgshaw2-redhat commented Jan 23, 2025

kiratp commented Jan 24, 2025

dyli-google commented Jan 24, 2025

miladm commented Jan 25, 2025 • edited Loading

kiratp commented Jan 23, 2025 •

edited

Loading

kiratp commented Jan 23, 2025 •

edited

Loading

bvrockwell commented Jan 23, 2025 •

edited

Loading

miladm commented Jan 25, 2025 •

edited

Loading