[Draft][RFC]: KVCache support nonuniform page size

### Motivation.

Currently, to ensure high utilization of the kv_cache in hybrid attention model scenarios, vLLM aligns the kv_cache block's page size across different layers and allows layers with different `kv_cache_spec` to use the same `kv_cache_tensor`.
However, this approach prevents vLLM from supporting scenarios where kv_cache blocks of different layers use nonuniform page sizes, such as in cases where kv_cache quantization is applied to certain layers of a single attention model.

### Proposed Change.

We would like to support the case when model has only single `kv_cache_spec` but with different page sizes from different layers. This will result in the following changes:
1. Add a branch in func `get_kv_cache_groups` to support the case with uniform `kv_cache_spec` and different page size, and the new branch only needs to modify the code of calculating `num_blocks` based on `available_memory`.
```python
    has_uniform_page_size = is_kv_cache_page_size_uniform(kv_cache_spec)
    if is_kv_cache_type_attention_free(kv_cache_spec):
        # This returns an empty list to allow for the KVCacheManager to handle
        # attention free models.
        return []
    elif is_kv_cache_spec_uniform(kv_cache_spec):
        # KV cache of all layers are the same, which is true for
        # most models. Allocate the same amount of memory for
        # each layer.
        return _get_kv_cache_groups_uniform_spec(kv_cache_spec)
    elif uniform_spec := UniformTypeKVCacheSpecs.from_specs(kv_cache_spec):
        # All layers need the same number of token slots (e.g., all layers are
        # full attention, or all layers are sliding window attention with the
        # same window size). Put all layers into one group.
        if has_uniform_page_size:
            return _get_kv_cache_groups_uniform_type(uniform_spec)
        else:
            return _get_kv_cache_groups_uniform_type_nonuniform_page_size(kv_cache_spec)    # new branch
    elif has_uniform_page_size:
        # Model contains multiple attention types, but KV cache of all layers
        # have the same physical memory per block per layer. Split the layers
        # into groups with the same number of layers, and thus same total page
        # size.
        return _get_kv_cache_groups_uniform_page_size(kv_cache_spec)
```

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Draft][RFC]: KVCache support nonuniform page size #25314

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Draft][RFC]: KVCache support nonuniform page size #25314

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions