Proposal to improve performance
Right now FlashInfer attn backend uses AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE that doesn't allow to use FULL cudagraph in case if speculative decoding.
In fact FlashInfer contains several attn those chose dynamically. TRT_LLM Gen backend (one of the fastest) supports speculative decoding (AttentionCGSupport.UNIFORM_BATCH) and it does make sense to enable cudagraph for decode phase.