-
Notifications
You must be signed in to change notification settings - Fork 544
Description
Motivation.
In the current implementation of vLLM_Ascend V0 Engine, the advance_step function in attention.py contains a section of Python-based logic that handles the update of input_tokens, seq_lens, input_positions, and slot_mapping.
This logic was marked with a clear TODO:
# TODO optimize these codes using ascendc just like flash attention backend using cudaindicating an explicit need for optimization using custom operators.
Proposed Change.
This RFC proposes to replace the above Python logic with a highly optimized custom operator implemented in AscendC, designed to execute directly on the NPU for improved efficiency in multi-step decoding scenarios.
The logic covered by this operator includes:
- Updating
model_input.input_tokens - Updating
model_input.input_positions - Incrementing and updating
seq_lens_tensor - Computing
slot_mappingusingblock_tables
Feedback Period.
This RFC will be open for feedback until [2025-05-18], which is one week from the initial submission date.
Please leave your comments, questions, or suggestions before this date. The author will address all feedback and revise the proposal accordingly if needed.
CC List.
Any Other Things.
No response