Skip to content

[RFC]: Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature. #807

@wonderful199082

Description

@wonderful199082

Motivation.

In the current implementation of vLLM_Ascend V0 Engine, the advance_step function in attention.py contains a section of Python-based logic that handles the update of input_tokens, seq_lens, input_positions, and slot_mapping.

This logic was marked with a clear TODO:

# TODO optimize these codes using ascendc just like flash attention backend using cuda

indicating an explicit need for optimization using custom operators.

Proposed Change.

This RFC proposes to replace the above Python logic with a highly optimized custom operator implemented in AscendC, designed to execute directly on the NPU for improved efficiency in multi-step decoding scenarios.

The logic covered by this operator includes:

  • Updating model_input.input_tokens
  • Updating model_input.input_positions
  • Incrementing and updating seq_lens_tensor
  • Computing slot_mapping using block_tables

Feedback Period.

This RFC will be open for feedback until [2025-05-18], which is one week from the initial submission date.

Please leave your comments, questions, or suggestions before this date. The author will address all feedback and revise the proposal accordingly if needed.

CC List.

@Yikun @wangxiyuan

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions