Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371

Open
1 task done
kiratp opened this issue Jan 23, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@kiratp
Copy link

kiratp commented Jan 23, 2025

Your current environment

The environment is the TPU nightly docker image

Model Input Dumps

No response

🐛 Describe the bug

Model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
Machine: TPU v6e-8
Image: vlm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c

I would suggest running this in the TPU VM using tmux

First, start the server

docker run --privileged -it --network host --rm -v /dev/shm:/data -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} -e VLLM_XLA_CACHE_PATH=/data/jax --shm-size=10.24gb vllm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c python3 -m vllm.entrypoints.openai.api_server --host=0.0.0.0 --port=8000 --tensor-parallel-size=8 --max-model-len=65536 --gpu-memory-utilization=0.75 --max-num-seqs=32 --model=neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8 --download-dir /data --disable-log-requests --enable_prefix_caching

Run the benchmark from another container instance (tmux pane)

docker run -it --rm --network host vllm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c  
python3 -m pip install -r requirements-test.txt  
cd benchmarks  
python3 benchmark_serving.py --model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8 --dataset-name sonnet --dataset-path sonnet.txt --num-prompts 32 --sonnet-input-len 65536 --sonnet-output-len 4096 --sonnet-prefix-len 32768 --port 8000

Performance is really degraded:

Image

Under certain conditions that are difficult to replicate on deman but have occured twice, the server eventually gets locked into a corrupted state and the client just gets max_tokens worth of garbage output

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<|reserved_special_token_247|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<|reserved_special_token_247|><|reserved_special_token_247|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247>

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@kiratp kiratp added the bug Something isn't working label Jan 23, 2025
@kiratp
Copy link
Author

kiratp commented Jan 23, 2025

Related: when running this same tests with shorter sequences, quantized throughput is lower than unquantized

Quantized:

Image

Unquantized:

Image

@bvrockwell
Copy link
Contributor

bvrockwell commented Jan 23, 2025

cc @richardsliu @dyli-google

@robertgshaw2-redhat
Copy link
Collaborator

Thanks for the report.

@robertgshaw2-redhat
Copy link
Collaborator

Do you only see this issue with W8A8 compute?

@kiratp
Copy link
Author

kiratp commented Jan 24, 2025

@robertgshaw2-redhat - IIRC from my notes, yes. However, if you can confirm this test matrix, I can get you a report for each config:

Quantization: on vs off (i.e. neuralmagic vs meta default model)
Prefix caching: on vs off.

Any other config?

Note that we have build from december from this branch that works fine for w8a8 (no prefix caching of course) #10435

@dyli-google
Copy link

@kiratp So this issue happens only when both quantization and prefix caching are on, right?

This makes sense, because each was tested separately. I don't think we tested them together.

Prefix caching: github.com//pull/10307
Quantization: github.com//pull/11785

Could you please send me and @robertgshaw2-redhat the more concrete reports? Thanks.

@miladm
Copy link
Collaborator

miladm commented Jan 25, 2025

cc @lsy323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants