Skip to content

[Performance]: FalshInfer attn backend. Use dynamic AttentionCGSupport #26856

@vadiklyutiy

Description

@vadiklyutiy

Proposal to improve performance

Right now FlashInfer attn backend uses AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE that doesn't allow to use FULL cudagraph in case if speculative decoding.

In fact FlashInfer contains several attn those chose dynamically. TRT_LLM Gen backend (one of the fastest) supports speculative decoding (AttentionCGSupport.UNIFORM_BATCH) and it does make sense to enable cudagraph for decode phase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions