Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: prefix-caching: inconsistent completions #5543

Open
hibukipanim opened this issue Jun 14, 2024 · 18 comments
Open

[Bug]: prefix-caching: inconsistent completions #5543

hibukipanim opened this issue Jun 14, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@hibukipanim
Copy link

hibukipanim commented Jun 14, 2024

Your current environment

vLLM version 0.5.0.post1

🐛 Describe the bug

Hi,

Seems that there is a dirty cache issue with --enable-prefix-caching. We noticed it as we saw internal eval scores significantly degrade when running with --enable-prefix-caching and here I'll show how to reproduce it with a short snippet.

Running 2 vLLM servers with:

without prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8001

and another with prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8002 --enable-prefix-caching

Then running this snippet:

import string 
import random

import openai

vllms = {
    "no-prefix-caching": "http://localhost:8001/v1",
    "with-prefix-caching": "http://localhost:8002/v1",
}

random.seed(0)
prompts = []
for i in range(16):
    prompts.append(''.join(random.choices(string.ascii_lowercase + string.digits, k=512)))

runs = []
for run in range(2):
    print(f"\n🏃 run #{run+1}")

    completions = {k: [] for k in vllms.keys()}
    runs.append(completions)
    for name, endpoint in vllms.items():
        print(f"vLLM {name=}, {endpoint=}")
        client = openai.OpenAI(
            base_url=endpoint,
            api_key="foo"
        )

        for prompt in prompts:
            response = client.completions.create(
                    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    prompt=prompt,
                    temperature=0,
                    max_tokens=4,
            )
            completion = response.choices[0].text
            completions[name].append(completion)

        print(f"completions: {completions[name]}")

        if run > 0 and runs[run][name] != runs[run-1][name]:
            print(f"❌ completions for vLLM {name=} differs from previous run!")
    
    if completions["with-prefix-caching"] != completions["no-prefix-caching"]:
        print("🛑 completions differ between with & without prefix")
        

prints:

🏃 run #1
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']

🏃 run #2
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwma71', '37wk', 'hu5qw', 'jg0m', '1tzkb', '4h7a', '5zq7', 'zxqj', '7k4n', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
❌ completions for vLLM name='with-prefix-caching' differs from previous run!
🛑 completions differ between with & without prefix

This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.

Hopefully one of these PR resolves the issue 🤞 :

@hibukipanim hibukipanim added the bug Something isn't working label Jun 14, 2024
@cadedaniel
Copy link
Collaborator

We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case shouldn't happen // if it is happening, we can more easily diagnose the failure. Note the v2 block manager is not yet optimized for performance.

Can you see if it occurs with --use-block-manager-v2?

@hibukipanim
Copy link
Author

hibukipanim commented Jun 17, 2024

Thanks for the reply @cadedaniel.
I tried now with --use-v2-block-manager (version 0.5.0.post1) and it still happens unfortunately.

Edit: Tried also building current main branch (commit e2b85cf) where #5364 is already merged, and the issue still happens (also with --use-v2-block-manager)

@hibukipanim
Copy link
Author

Built also the branch of #5188 and it doesn't resolve the issue

@colefranks
Copy link

possible workaround #5376 (comment)

@hibukipanim
Copy link
Author

Thanks @colefranks
I tried and seems that the workaround doesn't seem to help but it does change the behavior, tried several combinations (all with version 0.5.0.post1).

On first iteration, there is difference in outputs between VLLM_ATTENTION_BACKEND=XFORMERS and without. And if we assume that's ok, anyway when --enable-prefix-caching is used, than second iteration with --enable-prefix-caching differs from the first one.

@kuangdao
Copy link

is this issuse solved ? i meet the same problem, inconsistent completions .

@SaltFish11
Copy link

The same thing happened when I replaced the model with Opt-125m and inferred offline. However, when I inserted torch.mannual_seed () (not random.seed) before generate, the result was correct.

@bsll
Copy link

bsll commented Jul 12, 2024

@hibukipanim @kuangdao @SaltFish11 I sloved the problem by change the triton code.
in this file ../triton/common/build.py
cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
add the "-std=c99", after the lines,like this
if is_hip():
ret = subprocess.check_call([
cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
])
else:
cc_cmd = [
cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99",
"-o", so
]
cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
ret = subprocess.check_call(cc_cmd)

@hibukipanim
Copy link
Author

thanks @bsll, but I struggle to understand what triton you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py in either?

@LLouice
Copy link

LLouice commented Jul 15, 2024

thanks @bsll, but I struggle to understand what triton you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py in either?

thanks @bsll workaround. @hibukipanim the location is like /path/to/miniconda3/envs/vllm/lib/python3.9/site-packages/triton/common/build.py

@hibukipanim
Copy link
Author

Thanks @bsll & @LLouice
I tried to make the update you suggested in triton but unfortunately the issue is still reproduces (with 0.5.2) for me, with the exact snippet as in the first message.

To be more detailed with what I did:
I'm running vLLM in a virtualenv. Inside it I edited the file at: .venv/lib/python3.10/site-packages/triton/common/build.py and changed these lines:

    if is_hip():
        ret = subprocess.check_call([
            cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
            f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
        ])
    else:
        cc_cmd = [
            cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
            "-o", so
        ]
        cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
        ret = subprocess.check_call(cc_cmd)

to these lines:

    if is_hip():
        ret = subprocess.check_call([
            cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
            f"-L{hip_lib_dir}", "-lamdhip64", "-o", so
         ])
    else:
        cc_cmd = [
            cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99",
            "-o", so
        ]
        cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
        ret = subprocess.check_call(cc_cmd)

i.e.:

97c97
<             cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC",
---
>             cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99",
99c99
<         ])
---
>          ])
102c102
<             cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
---
>             cc, src, "-O3", f"-I{

I also deleted ~/.triton which had cache (but it wasn't created again after running, so maybe it's not really used in this flow?). Then I re-ran the server.

and must say it's quite surprising that changing the gcc dialect to std99 would change behavior (but cool if it does ..)

@hibukipanim
Copy link
Author

@SaltFish11 thanks for the comment. However, I tried adding:

import torch
torch.manual_seed(42)

at the top of vllm/entrypoints/openai/api_server.py and the issue still reproduces

@roger0426
Copy link

roger0426 commented Jul 19, 2024

Same here, adding "-std=c99" still produces strange output, repeating with 3 to 4 spaces.
i.e.
1. I need to use 2 use 2 use 2 use 2 use 2

Looking forward to further solutions.

@hibukipanim
Copy link
Author

hibukipanim commented Aug 4, 2024

@zachzzc Thanks for #7018

FYI - I tried it now with the commit which merged your PR (fb2c1c8) and also with the current HEAD (9fadc7b) and unfortunately the snippet from the issue still fails with both. (and I double-checked it by running same versions without --enable-prefix-caching and it was ok. So prefix-caching still has correctness issues)

@zachzzc
Copy link
Contributor

zachzzc commented Aug 5, 2024

Have you tried the meaningful inputs (like real sentences) instead of random number? Wonder if it is just caused by the minor kernel execution difference after the cache is hit.

@hibukipanim
Copy link
Author

@zachzzc
Originally I opened this issue after seeing degradation in some internal evals which used real inputs. (Altough it would happen more often under concurrent requests).
Why would kernel execution be different in case of cache hit?

@zachzzc
Copy link
Contributor

zachzzc commented Aug 7, 2024

If you still see the degradation in the real evals then it would be a true bug. It calls the same kernel with different inputs here

depending on how if the cache hits or not. Will update here if I find anything.

@hiyforever
Copy link

+1 in 0.5.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants