-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output #12371
Comments
Thanks for the report. |
Do you only see this issue with W8A8 compute? |
@robertgshaw2-redhat - IIRC from my notes, yes. However, if you can confirm this test matrix, I can get you a report for each config: Quantization: on vs off (i.e. neuralmagic vs meta default model) Any other config? Note that we have build from december from this branch that works fine for w8a8 (no prefix caching of course) #10435 |
@kiratp So this issue happens only when both quantization and prefix caching are on, right? This makes sense, because each was tested separately. I don't think we tested them together. Prefix caching: github.com//pull/10307 Could you please send me and @robertgshaw2-redhat the more concrete reports? Thanks. |
cc @lsy323 |
Your current environment
The environment is the TPU nightly docker image
Model Input Dumps
No response
🐛 Describe the bug
Model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
Machine: TPU v6e-8
Image: vlm/vllm-tpu:2fc6944c5e69d5d0ce15d09a855452c795d75c3c
I would suggest running this in the TPU VM using tmux
First, start the server
Run the benchmark from another container instance (tmux pane)
Performance is really degraded:
Under certain conditions that are difficult to replicate on deman but have occured twice, the server eventually gets locked into a corrupted state and the client just gets
max_tokens
worth of garbage outputBefore submitting a new issue...
The text was updated successfully, but these errors were encountered: