-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: prefix-caching: inconsistent completions #5543
Comments
We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case shouldn't happen // if it is happening, we can more easily diagnose the failure. Note the v2 block manager is not yet optimized for performance. Can you see if it occurs with |
Thanks for the reply @cadedaniel. Edit: Tried also building current |
Built also the branch of #5188 and it doesn't resolve the issue |
possible workaround #5376 (comment) |
Thanks @colefranks On first iteration, there is difference in outputs between |
is this issuse solved ? i meet the same problem, inconsistent completions . |
The same thing happened when I replaced the model with Opt-125m and inferred offline. However, when I inserted torch.mannual_seed () (not random.seed) before generate, the result was correct. |
@hibukipanim @kuangdao @SaltFish11 I sloved the problem by change the triton code. |
thanks @bsll, but I struggle to understand what |
thanks @bsll workaround. @hibukipanim the location is like |
Thanks @bsll & @LLouice To be more detailed with what I did: if is_hip():
ret = subprocess.check_call([
cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
])
else:
cc_cmd = [
cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
"-o", so
]
cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
ret = subprocess.check_call(cc_cmd) to these lines: if is_hip():
ret = subprocess.check_call([
cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
])
else:
cc_cmd = [
cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99",
"-o", so
]
cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
ret = subprocess.check_call(cc_cmd) i.e.: 97c97
< cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
---
> cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
99c99
< ])
---
> ])
102c102
< cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
---
> cc, src, "-O3", f"-I{ I also deleted and must say it's quite surprising that changing the gcc dialect to std99 would change behavior (but cool if it does ..) |
@SaltFish11 thanks for the comment. However, I tried adding: import torch
torch.manual_seed(42) at the top of |
Same here, adding "-std=c99" still produces strange output, repeating with 3 to 4 spaces. Looking forward to further solutions. |
FYI - I tried it now with the commit which merged your PR (fb2c1c8) and also with the current HEAD (9fadc7b) and unfortunately the snippet from the issue still fails with both. (and I double-checked it by running same versions without |
Have you tried the meaningful inputs (like real sentences) instead of random number? Wonder if it is just caused by the minor kernel execution difference after the cache is hit. |
@zachzzc |
If you still see the degradation in the real evals then it would be a true bug. It calls the same kernel with different inputs here vllm/vllm/attention/backends/flash_attn.py Line 532 in 5223199
|
+1 in 0.5.2 |
Your current environment
🐛 Describe the bug
Hi,
Seems that there is a dirty cache issue with
--enable-prefix-caching
. We noticed it as we saw internal eval scores significantly degrade when running with--enable-prefix-caching
and here I'll show how to reproduce it with a short snippet.Running 2 vLLM servers with:
without prefix caching:
and another with prefix caching:
Then running this snippet:
prints:
This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.
Hopefully one of these PR resolves the issue 🤞 :
(will be able to try building these branches and reproducing only in few days, hope tagging the PRs can help till then)
Edit: built and tried both PRs and they don't resolve the issue
The text was updated successfully, but these errors were encountered: