[Performance]: FalshInfer attn backend. Use dynamic AttentionCGSupport

### Proposal to improve performance

Right now FlashInfer attn backend uses `AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE` that doesn't allow to use FULL cudagraph in case if speculative decoding. 

In fact FlashInfer contains several attn those chose dynamically. TRT_LLM Gen backend (one of the fastest) supports speculative decoding (`AttentionCGSupport.UNIFORM_BATCH`) and it does make sense to enable cudagraph for decode phase. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Performance]: FalshInfer attn backend. Use dynamic AttentionCGSupport #26856

Proposal to improve performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Performance]: FalshInfer attn backend. Use dynamic AttentionCGSupport #26856

Description

Proposal to improve performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions