FEAT: Xavier: Share KV cache between VLLM replicas #2732

ChengjieLi28 · 2025-01-03T03:54:27Z

Xavier: Share KV cache between VLLM replicas

Naming

It is derived from Professor X (Charles Francis Xavier) in the Marvel Comics X-Men series. The project name starts with "X," and like Professor X, who possesses a powerful mind that controls information, this metaphorically refers to the project managing the data scheduling in vllm.

Purpose

In vllm with multiple replicas, some long prompts have a lengthy prefill time. If other replicas have already computed the results, they can be directly transferred and used.

Usage

Simply add the parameter enable_xavier=True when starting the vllm model.

Test

Using this script to generate a long prompt for LLM (about 9k+ prompt token):

from faker import Faker
import pandas as pd


def gen_data(lines: int):
    faker = Faker()
    data = {
        "Name": [faker.name() for _ in range(lines)],
        "Age": [faker.random_int(min=15, max=80) for _ in range(lines)],
        "Occupation": [faker.job() for _ in range(lines)],
        "Country": [faker.country() for _ in range(lines)],
        "Email": [faker.email() for _ in range(lines)],
        "Address": [faker.address() for _ in range(lines)],
        "Phone Number": [faker.phone_number() for _ in range(lines)]
    }
    df = pd.DataFrame(data)
    markdown_table = df.to_markdown(index=False)
    return markdown_table

LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + f"""
{gen_data(100)}
"""
q1 = "Question: What is the name and country of ID 23? Your answer: The name and country of ID 23 are "
q2 = "Question: What is the name and country of ID 96? Your answer: The name and country of ID 96 are "

Use LONG_PROMPT+q1 and LONG_PROMPT+q2 as prompts to interact with the model separately for each query.

Test Results:

Env two RTX 3090TI with nvlink
Qwen2.5-instruct 7B with 2 replicas (one replica on one card)

First query (without cache, just calculating) E2E time:
LONG_PROMPT+q1: ~2.96 s
Second query (with transferring) E2E time:
LONG_PROMPT+q2: ~1.33 s

Limitations

Rollback for xinference is not currently supported (it will be supported in the future)
Enabling Xavier means enabling vllm's enable_prefix_caching. The vllm version needs to be >= 0.6.5
Gloo cannot recognize the 0.0.0.0 address, so when starting xinference, you need to use the actual IP address, for example: xinference-local -H 192.168.xx.xx.

qinxuye

LGTM

codingl2k1 · 2025-01-10T08:48:58Z

A Corner Case:

Model eviction of the block.
Query the block from the block tracker.
Unregister the evicted block from the block tracker.
The block is evicted.
Transfer the block.

When transferring blocks, the block may be evicted or replaced by a new one. It's better to use a block hash during transfers. If the block is evicted or there is a block hash mismatch, we can simply handle it as a cache miss.

qinxuye · 2025-01-10T08:56:13Z

How can we produce the corner case?

codingl2k1 · 2025-01-10T09:21:25Z

How can we produce the corner case?

We can add mock logic to produce it. For example, call evict or modify (to simulate the block replacement) on the model block while querying the block.

qinxuye · 2025-01-10T09:31:36Z

OK, how about opening a new issue to track this?

codingl2k1 · 2025-01-10T09:43:44Z

OK, how about opening a new issue to track this?

Let me open an issue.

kexinoh · 2025-02-24T13:55:54Z

There are actually some issues here. In previous versions of vllm, the prefix cache hash returned different values randomly across processes (for Python < 3.12, details can be found at python/cpython#99540). After CVE-2025-25183, it was changed so that all versions are sufficiently random, which was done for some users' security considerations.

Therefore, if you want to achieve cross-process synchronization, you might need to preset a secret key or an agreed-upon value to avoid actual desynchronization. This can be achieved by controlling PYTHONHASHSEED, which allows you to do this without having to intervene in the internal code of vllm.

I'm not sure if my understanding is correct, as I am not very familiar with the project itself. If there are any mistakes, please forgive me.

qinxuye · 2025-02-24T14:38:53Z

Not sure about the randomness, if hash(token_ids) is not a deterministic value, that would be wrong for prefix cache itself.

kexinoh · 2025-02-24T14:41:10Z

They are only equal within the same process; they are not equal across different processes.

dev

45e4859

XprobeBot added the feature label Jan 3, 2025

XprobeBot added this to the v1.x milestone Jan 3, 2025

ChengjieLi28 added 12 commits January 3, 2025 17:53

dev

41d0767

fix issue when cached block been evicted

722fbbd

Simplified data structures

0a23f70

fix

6a48296

fix unregister block

405fe16

UT for block tracker actor

33a09fe

add bufferred transfer and fix issue with torch.bfloat16

c5ba6ae

fix mypy

b2b9dd1

add entrypoint for xavier

50b6ef0

remove print and delete unused files

c14598b

copyright

1e91af0

code comments

6454dc8

ChengjieLi28 changed the title ~~FEAT: [WIP] Xavier: Share KV cache between VLLM replicas~~ FEAT: Xavier: Share KV cache between VLLM replicas Jan 9, 2025

fix log

eeb0d7c

ChengjieLi28 marked this pull request as ready for review January 9, 2025 11:12

ChengjieLi28 added 4 commits January 10, 2025 12:01

fix cb test

06289eb

add log when cache evicted

afe9a59

update readme

be3cb29

doc

45e85e3

qinxuye approved these changes Jan 10, 2025

View reviewed changes

ChengjieLi28 merged commit 545ee12 into xorbitsai:main Jan 10, 2025
12 of 13 checks passed

codingl2k1 mentioned this pull request Jan 10, 2025

Improve kv block transfer #2754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Xavier: Share KV cache between VLLM replicas #2732

FEAT: Xavier: Share KV cache between VLLM replicas #2732

ChengjieLi28 commented Jan 3, 2025 •

edited

Loading

qinxuye left a comment

codingl2k1 commented Jan 10, 2025

qinxuye commented Jan 10, 2025

codingl2k1 commented Jan 10, 2025

qinxuye commented Jan 10, 2025

codingl2k1 commented Jan 10, 2025 •

edited

Loading

kexinoh commented Feb 24, 2025

qinxuye commented Feb 24, 2025

kexinoh commented Feb 24, 2025

FEAT: Xavier: Share KV cache between VLLM replicas #2732

FEAT: Xavier: Share KV cache between VLLM replicas #2732

Conversation

ChengjieLi28 commented Jan 3, 2025 • edited Loading

Xavier: Share KV cache between VLLM replicas

Naming

Purpose

Usage

Test

Limitations

qinxuye left a comment

Choose a reason for hiding this comment

codingl2k1 commented Jan 10, 2025

qinxuye commented Jan 10, 2025

codingl2k1 commented Jan 10, 2025

qinxuye commented Jan 10, 2025

codingl2k1 commented Jan 10, 2025 • edited Loading

kexinoh commented Feb 24, 2025

qinxuye commented Feb 24, 2025

kexinoh commented Feb 24, 2025

ChengjieLi28 commented Jan 3, 2025 •

edited

Loading

codingl2k1 commented Jan 10, 2025 •

edited

Loading