Skip to content

[Draft][RFC]: KVCache support nonuniform page size #25314

@zzzzwwjj

Description

@zzzzwwjj

Motivation.

Currently, to ensure high utilization of the kv_cache in hybrid attention model scenarios, vLLM aligns the kv_cache block's page size across different layers and allows layers with different kv_cache_spec to use the same kv_cache_tensor.
However, this approach prevents vLLM from supporting scenarios where kv_cache blocks of different layers use nonuniform page sizes, such as in cases where kv_cache quantization is applied to certain layers of a single attention model.

Proposed Change.

We would like to support the case when model has only single kv_cache_spec but with different page sizes from different layers. This will result in the following changes:

  1. Add a branch in func get_kv_cache_groups to support the case with uniform kv_cache_spec and different page size, and the new branch only needs to modify the code of calculating num_blocks based on available_memory.
    has_uniform_page_size = is_kv_cache_page_size_uniform(kv_cache_spec)
    if is_kv_cache_type_attention_free(kv_cache_spec):
        # This returns an empty list to allow for the KVCacheManager to handle
        # attention free models.
        return []
    elif is_kv_cache_spec_uniform(kv_cache_spec):
        # KV cache of all layers are the same, which is true for
        # most models. Allocate the same amount of memory for
        # each layer.
        return _get_kv_cache_groups_uniform_spec(kv_cache_spec)
    elif uniform_spec := UniformTypeKVCacheSpecs.from_specs(kv_cache_spec):
        # All layers need the same number of token slots (e.g., all layers are
        # full attention, or all layers are sliding window attention with the
        # same window size). Put all layers into one group.
        if has_uniform_page_size:
            return _get_kv_cache_groups_uniform_type(uniform_spec)
        else:
            return _get_kv_cache_groups_uniform_type_nonuniform_page_size(kv_cache_spec)    # new branch
    elif has_uniform_page_size:
        # Model contains multiple attention types, but KV cache of all layers
        # have the same physical memory per block per layer. Split the layers
        # into groups with the same number of layers, and thus same total page
        # size.
        return _get_kv_cache_groups_uniform_page_size(kv_cache_spec)

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions