[RFC]: Reorganizing ViT Abstraction and Attention Selection Logic

### Motivation.

This RFC is aimed to address the following issues:
1. The ViT right now is still pretty coupled with Text backbone attention. This RFC will further the effort to decouple the ViT and the text backbone attention.

2. Another pain point is that the overriding of the ViT logic is scattered all around the places. We should avoid doing ViT logic overriding in model definition classes. The platform class should define the logic of what ViT is supported and how it should be overwritten.

3. Since the introduction of `torch.compile` into the ViT, currently only starting with qwen vl model in PR https://github.com/vllm-project/vllm/pull/23207 , the AMD ViT Code path are broken. New approach will try to accommodate this new feature. `torch.compile` has brought a lot of performance improvement and we can now consider to replace triton kernels with pytorch native implementation as there are possibilities that `torch.compile` code is faster than custom `triton kernel` code.

4. Ensure ViT changes take into account the other model definition files `model.py` files, as current changes only involves `qwen_2_5_vl.py` which potentially affecting all other `models.py` files. 

- `vllm/model_executor/models/dots_ocr.py`
- `vllm/model_executor/models/ernie45_vl.py`
- `vllm/model_executor/models/glm4_1v.py`
- `vllm/model_executor/models/qwen2_vl.py`
- `vllm/model_executor/models/siglip2navit.py`







### Proposed Change.

**NOTE:** More changes to the details will come in while I am writing up a version with all these changes.

For a first quick reorganization of the ViT Attention while retaining current use of `--mm-encoder-atttention-backend` interface, introduced in the PR is https://github.com/vllm-project/vllm/pull/27061 , and a bugfix PR https://github.com/vllm-project/vllm/pull/27124 .


1. First, we should shrink down the  https://github.com/vllm-project/vllm/pull/27061/files#r2443909604 the `_Backend` by introducing another `_MHA_Backend` registry.

2. Make sure that the ViT attention is a platform specific. We should determine `platform` interface. We also perform override in the `platform` interface. We should avoid doing that in the `model.py` files.

  - `get_vit_attn_backend` in the `platform` interface has to be able to access the `--mm-encoder-attn-backend`.

  - In the `platform` interface, we should only return `_MHA_Backend`, we should not return the functions. The functions should only be returned through `maybe_get_vit_flash_attn_backend` .

  - Honor `--mm-encoder-attn-backend` so that we can write unit tests to test all different backends. AMD Instinct GPU is able to test all backends. Radeon GPUs only are able to use the TORCH_SDPA code path.

  - We need to deprecate this line `https://github.com/vllm-project/vllm/blob/33a0ea5f3264b5b2f571b8a53357e10efcc94670/vllm/model_executor/models/vision.py#L96` it is using `VLLM_ATTENTION_BACKEND` which is for Text Backbone. The ViT should not use this environment variable.

  - Added a `logger.info_once` so that users know which `_MHA_Backend` is selected in the end.

  - Clean up cuda code path. Since `vllm.vllm_flash_attn` is just a wrapper for `flash_attn` library, on cuda, we always use `vllm.vllm_flash_attn` instead of `flash_attn`. 

  - https://github.com/vllm-project/vllm/blob/ba33e8830dceb32e9b03508bbff435e3082759b8/vllm/attention/layer.py#L120-L125 .

3.  Enable unit tests to test all different backends. Since there are large model sizes, we will check the VRAM size, if it is large enough, we run it. We provide such a unit test so that developers can run locally.




### Feedback Period.


Changes


### CC List.

@ywang96  @DarkLight1337  @Isotr0py 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	elif current_platform.is_cuda():
	if attn_backend != _Backend.FLASH_ATTN and check_upstream_fa_availability(
	torch.get_default_dtype()
	):
	attn_backend = _Backend.FLASH_ATTN
	use_upstream_fa = True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Reorganizing ViT Abstraction and Attention Selection Logic #27822

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Reorganizing ViT Abstraction and Attention Selection Logic #27822

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions