- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
Closed
Labels
Description
Motivation.
Currently, to ensure high utilization of the kv_cache in hybrid attention model scenarios, vLLM aligns the kv_cache block's page size across different layers and allows layers with different kv_cache_spec to use the same kv_cache_tensor.
However, this approach prevents vLLM from supporting scenarios where kv_cache blocks of different layers use nonuniform page sizes, such as in cases where kv_cache quantization is applied to certain layers of a single attention model.
Proposed Change.
We would like to support the case when model has only single kv_cache_spec but with different page sizes from different layers. This will result in the following changes:
- Add a branch in func 
get_kv_cache_groupsto support the case with uniformkv_cache_specand different page size, and the new branch only needs to modify the code of calculatingnum_blocksbased onavailable_memory. 
    has_uniform_page_size = is_kv_cache_page_size_uniform(kv_cache_spec)
    if is_kv_cache_type_attention_free(kv_cache_spec):
        # This returns an empty list to allow for the KVCacheManager to handle
        # attention free models.
        return []
    elif is_kv_cache_spec_uniform(kv_cache_spec):
        # KV cache of all layers are the same, which is true for
        # most models. Allocate the same amount of memory for
        # each layer.
        return _get_kv_cache_groups_uniform_spec(kv_cache_spec)
    elif uniform_spec := UniformTypeKVCacheSpecs.from_specs(kv_cache_spec):
        # All layers need the same number of token slots (e.g., all layers are
        # full attention, or all layers are sliding window attention with the
        # same window size). Put all layers into one group.
        if has_uniform_page_size:
            return _get_kv_cache_groups_uniform_type(uniform_spec)
        else:
            return _get_kv_cache_groups_uniform_type_nonuniform_page_size(kv_cache_spec)    # new branch
    elif has_uniform_page_size:
        # Model contains multiple attention types, but KV cache of all layers
        # have the same physical memory per block per layer. Split the layers
        # into groups with the same number of layers, and thus same total page
        # size.
        return _get_kv_cache_groups_uniform_page_size(kv_cache_spec)Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.