[RFC]:  Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature.

### Motivation.

In the current implementation of `vLLM_Ascend` V0 Engine, the `advance_step` function in `attention.py` contains a section of Python-based logic that handles the update of `input_tokens`, `seq_lens`, `input_positions`, and `slot_mapping`.

This logic was marked with a clear `TODO`:
```python
# TODO optimize these codes using ascendc just like flash attention backend using cuda
```
indicating an explicit need for optimization using custom operators.

### Proposed Change.

This RFC proposes to replace the above Python logic with a highly optimized custom operator implemented in AscendC, designed to execute directly on the NPU for improved efficiency in multi-step decoding scenarios.

The logic covered by this operator includes:
- Updating `model_input.input_tokens`
- Updating `model_input.input_positions`
- Incrementing and updating `seq_lens_tensor`
- Computing `slot_mapping` using `block_tables`

### Feedback Period.

This RFC will be open for feedback until **[2025-05-18]**, which is one week from the initial submission date.

Please leave your comments, questions, or suggestions before this date. The author will address all feedback and revise the proposal accordingly if needed.

### CC List.

@Yikun @wangxiyuan 

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature. #807

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature. #807

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions