[Bug]: prefix-caching: inconsistent completions #5543

hibukipanim · 2024-06-14T14:18:02Z

Your current environment

vLLM version 0.5.0.post1

🐛 Describe the bug

Hi,

Seems that there is a dirty cache issue with --enable-prefix-caching. We noticed it as we saw internal eval scores significantly degrade when running with --enable-prefix-caching and here I'll show how to reproduce it with a short snippet.

Running 2 vLLM servers with:

without prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8001

and another with prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8002 --enable-prefix-caching

Then running this snippet:

import string 
import random

import openai

vllms = {
    "no-prefix-caching": "http://localhost:8001/v1",
    "with-prefix-caching": "http://localhost:8002/v1",
}

random.seed(0)
prompts = []
for i in range(16):
    prompts.append(''.join(random.choices(string.ascii_lowercase + string.digits, k=512)))

runs = []
for run in range(2):
    print(f"\n🏃 run #{run+1}")

    completions = {k: [] for k in vllms.keys()}
    runs.append(completions)
    for name, endpoint in vllms.items():
        print(f"vLLM {name=}, {endpoint=}")
        client = openai.OpenAI(
            base_url=endpoint,
            api_key="foo"
        )

        for prompt in prompts:
            response = client.completions.create(
                    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    prompt=prompt,
                    temperature=0,
                    max_tokens=4,
            )
            completion = response.choices[0].text
            completions[name].append(completion)

        print(f"completions: {completions[name]}")

        if run > 0 and runs[run][name] != runs[run-1][name]:
            print(f"❌ completions for vLLM {name=} differs from previous run!")
    
    if completions["with-prefix-caching"] != completions["no-prefix-caching"]:
        print("🛑 completions differ between with & without prefix")

prints:

🏃 run #1
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']

🏃 run #2
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwma71', '37wk', 'hu5qw', 'jg0m', '1tzkb', '4h7a', '5zq7', 'zxqj', '7k4n', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
❌ completions for vLLM name='with-prefix-caching' differs from previous run!
🛑 completions differ between with & without prefix

This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.

Hopefully one of these PR resolves the issue 🤞 :

[Core][Prefix Caching] Fix hashing logic for non-full blocks #5188
[Core][Bugfix]: fix prefix caching for blockv2 #5364
(will be able to try building these branches and reproducing only in few days, hope tagging the PRs can help till then)
Edit: built and tried both PRs and they don't resolve the issue

The text was updated successfully, but these errors were encountered:

cadedaniel · 2024-06-15T00:39:56Z

We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case shouldn't happen // if it is happening, we can more easily diagnose the failure. Note the v2 block manager is not yet optimized for performance.

Can you see if it occurs with --use-block-manager-v2?

hibukipanim · 2024-06-17T06:42:50Z

Thanks for the reply @cadedaniel.
I tried now with --use-v2-block-manager (version 0.5.0.post1) and it still happens unfortunately.

Edit: Tried also building current main branch (commit e2b85cf) where #5364 is already merged, and the issue still happens (also with --use-v2-block-manager)

hibukipanim · 2024-06-17T10:31:54Z

Built also the branch of #5188 and it doesn't resolve the issue

colefranks · 2024-06-19T21:36:44Z

possible workaround #5376 (comment)

hibukipanim · 2024-06-23T07:14:15Z

Thanks @colefranks
I tried and seems that the workaround doesn't seem to help but it does change the behavior, tried several combinations (all with version 0.5.0.post1).

On first iteration, there is difference in outputs between VLLM_ATTENTION_BACKEND=XFORMERS and without. And if we assume that's ok, anyway when --enable-prefix-caching is used, than second iteration with --enable-prefix-caching differs from the first one.

kuangdao · 2024-07-11T02:40:55Z

is this issuse solved ? i meet the same problem, inconsistent completions .

SaltFish11 · 2024-07-12T02:45:23Z

The same thing happened when I replaced the model with Opt-125m and inferred offline. However, when I inserted torch.mannual_seed () (not random.seed) before generate, the result was correct.

bsll · 2024-07-12T08:15:53Z

@hibukipanim @kuangdao @SaltFish11 I sloved the problem by change the triton code.
in this file ../triton/common/build.py
cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
add the "-std=c99", after the lines,like this
if is_hip():
ret = subprocess.check_call([
cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
])
else:
cc_cmd = [
cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99",
"-o", so
]
cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
ret = subprocess.check_call(cc_cmd)

hibukipanim · 2024-07-14T12:56:34Z

thanks @bsll, but I struggle to understand what triton you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py in either?

LLouice · 2024-07-15T03:22:53Z

thanks @bsll, but I struggle to understand what triton you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py in either?

thanks @bsll workaround. @hibukipanim the location is like /path/to/miniconda3/envs/vllm/lib/python3.9/site-packages/triton/common/build.py

hibukipanim · 2024-07-19T14:56:36Z

Thanks @bsll & @LLouice
I tried to make the update you suggested in triton but unfortunately the issue is still reproduces (with 0.5.2) for me, with the exact snippet as in the first message.

To be more detailed with what I did:
I'm running vLLM in a virtualenv. Inside it I edited the file at: .venv/lib/python3.10/site-packages/triton/common/build.py and changed these lines:

    if is_hip():
        ret = subprocess.check_call([
            cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
            f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
        ])
    else:
        cc_cmd = [
            cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
            "-o", so
        ]
        cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
        ret = subprocess.check_call(cc_cmd)

to these lines:

    if is_hip():
        ret = subprocess.check_call([
            cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
            f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
         ])
    else:
        cc_cmd = [
            cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99",
            "-o", so
        ]
        cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
        ret = subprocess.check_call(cc_cmd)

i.e.:

97c97
<             cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
---
>             cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
99c99
<         ])
---
>          ])
102c102
<             cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
---
>             cc, src, "-O3", f"-I{

I also deleted ~/.triton which had cache (but it wasn't created again after running, so maybe it's not really used in this flow?). Then I re-ran the server.

and must say it's quite surprising that changing the gcc dialect to std99 would change behavior (but cool if it does ..)

hibukipanim · 2024-07-19T15:06:25Z

@SaltFish11 thanks for the comment. However, I tried adding:

import torch
torch.manual_seed(42)

at the top of vllm/entrypoints/openai/api_server.py and the issue still reproduces

roger0426 · 2024-07-19T16:15:29Z

Same here, adding "-std=c99" still produces strange output, repeating with 3 to 4 spaces.
i.e.
1. I need to use 2 use 2 use 2 use 2 use 2

Looking forward to further solutions.

hibukipanim · 2024-08-04T12:28:51Z

@zachzzc Thanks for #7018

FYI - I tried it now with the commit which merged your PR (fb2c1c8) and also with the current HEAD (9fadc7b) and unfortunately the snippet from the issue still fails with both. (and I double-checked it by running same versions without --enable-prefix-caching and it was ok. So prefix-caching still has correctness issues)

zachzzc · 2024-08-05T18:33:12Z

Have you tried the meaningful inputs (like real sentences) instead of random number? Wonder if it is just caused by the minor kernel execution difference after the cache is hit.

hibukipanim · 2024-08-07T07:18:37Z

@zachzzc
Originally I opened this issue after seeing degradation in some internal evals which used real inputs. (Altough it would happen more often under concurrent requests).
Why would kernel execution be different in case of cache hit?

zachzzc · 2024-08-07T18:34:19Z

If you still see the degradation in the real evals then it would be a true bug. It calls the same kernel with different inputs here

vllm/vllm/attention/backends/flash_attn.py

Line 532 in 5223199

else:

depending on how if the cache hits or not. Will update here if I find anything.

hiyforever · 2024-09-19T10:09:12Z

+1 in 0.5.2

hibukipanim added the bug Something isn't working label Jun 14, 2024

hibukipanim mentioned this issue Jun 19, 2024

[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) #5602

Merged

hibukipanim mentioned this issue Jul 19, 2024

[Misc] Enhance prefix-caching benchmark tool #6568

Merged

zachzzc mentioned this issue Aug 2, 2024

[Bugfix] Fix block table for seqs that have prefix cache hits #7018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: prefix-caching: inconsistent completions #5543

[Bug]: prefix-caching: inconsistent completions #5543

hibukipanim commented Jun 14, 2024 •

edited

Loading

cadedaniel commented Jun 15, 2024

hibukipanim commented Jun 17, 2024 •

edited

Loading

hibukipanim commented Jun 17, 2024

colefranks commented Jun 19, 2024

hibukipanim commented Jun 23, 2024

kuangdao commented Jul 11, 2024

SaltFish11 commented Jul 12, 2024

bsll commented Jul 12, 2024

hibukipanim commented Jul 14, 2024

LLouice commented Jul 15, 2024

hibukipanim commented Jul 19, 2024

hibukipanim commented Jul 19, 2024

roger0426 commented Jul 19, 2024 •

edited

Loading

hibukipanim commented Aug 4, 2024 •

edited by linear bot

Loading

zachzzc commented Aug 5, 2024 •

edited

Loading

hibukipanim commented Aug 7, 2024

zachzzc commented Aug 7, 2024

hiyforever commented Sep 19, 2024

[Bug]: prefix-caching: inconsistent completions #5543

[Bug]: prefix-caching: inconsistent completions #5543

Comments

hibukipanim commented Jun 14, 2024 • edited Loading

Your current environment

🐛 Describe the bug

cadedaniel commented Jun 15, 2024

hibukipanim commented Jun 17, 2024 • edited Loading

hibukipanim commented Jun 17, 2024

colefranks commented Jun 19, 2024

hibukipanim commented Jun 23, 2024

kuangdao commented Jul 11, 2024

SaltFish11 commented Jul 12, 2024

bsll commented Jul 12, 2024

hibukipanim commented Jul 14, 2024

LLouice commented Jul 15, 2024

hibukipanim commented Jul 19, 2024

hibukipanim commented Jul 19, 2024

roger0426 commented Jul 19, 2024 • edited Loading

hibukipanim commented Aug 4, 2024 • edited by linear bot Loading

zachzzc commented Aug 5, 2024 • edited Loading

hibukipanim commented Aug 7, 2024

zachzzc commented Aug 7, 2024

hiyforever commented Sep 19, 2024

hibukipanim commented Jun 14, 2024 •

edited

Loading

hibukipanim commented Jun 17, 2024 •

edited

Loading

roger0426 commented Jul 19, 2024 •

edited

Loading

hibukipanim commented Aug 4, 2024 •

edited by linear bot

Loading

zachzzc commented Aug 5, 2024 •

edited

Loading