Make flashinfer kernels cuda graphs friendly #187

AgrawalAmey · 2024-03-20T00:14:09Z

Thanks for creating these awesome kernels! I am trying to get flashinfer kernels to work with cuda graphs. But it appears that several parallelism decisions (block size, num_q_tiles, etc.) are made on the fly based on the input data in the forward function. This makes it difficult to capture flashinfer kernels in cuda graphs in a generic manner. I think one solution to the problem would be to introduce a launcher kernel which would factor in the input metadata and launch the actual the actual cuda kernel using dynamic parallelism. Towards that, following are the items I have identified --

1. BatchPrefillWithPagedKVCachePyTorchWrapper::Forward -- handle return lse?
2. BatchPrefillWithPagedKVCachePyTorchWrapper::Forward -- paged_kv_t batch_size should not be on cpu side
3. BatchPrefillWithPagedKVCacheWrapperDispatched -- make cuda device function or get rid of it
4. BatchPrefillWithPagedKVCacheWrapperDispatched -- num_frags_x, num_qo_tiles, batch size need to be 
5. BatchPrefillWithPagedKVCacheWrapperDispatched -- do not access handler state directly in the function
6. BatchPrefillWithPagedKVCacheDispatched -- make cuda device function
7. BatchPrefillWithPagedKVCacheDispatched -- put num_qo_tiles on device accessible memory
8. BatchPrefillWithPagedKVCacheDispatched -- Make validations gpu friendly
9. Batch size should be explicit input parameter not be based on length of indptr, so that inputs can be padded.

@yzh119 please let me know what would be the best way to proceed?

The text was updated successfully, but these errors were encountered:

yzh119 · 2024-03-22T23:59:06Z

Hi @AgrawalAmey , thanks for bringing this up, I have some ideas about the CUDA graph integration with flashinfer:

The kernels to be executed can be determined before the a decode/prefill step (for all layers) by analyze the shapes, we can compile the CUDA Graph for all possible combinations (not too many) ahead of time, and dispatch to one of them according to the shapes.

Regarding dynamic parallelism:

introduce a launcher kernel which would factor in the input metadata and launch the actual the actual cuda kernel using dynamic parallelism

It sounds tricky to me because the required shared memory size/grid size varies for different schedules.

AgrawalAmey · 2024-03-23T00:08:31Z

Hi @yzh119!

I have one implementation in sarathi-serve which tries to list different combinations, and capture them. But with increasing batch size and big variance in input sequences, the number of possibilities seemed explode. Plus, prefill + decode requests clubbed together makes it further more challenging. The memory cost of cuda graphs becomes too high as the number of combinations increases.

The child kernel/dynamic parallelism proposal is aimed to solve the challenge with different grid size etc. Essentially, the launcher kernel will be triggered with a single warp. Inside the launcher kernel, we can determine all the launch params and launch the actual attention kernel.

AgrawalAmey · 2024-03-23T00:11:39Z

A sample program to explain what I mean:

#include <cuda_runtime.h>
#include <iostream>


__global__ void subKernel(int *data) {
    printf("Data before sub kernel: %d\n", *data);
    (*data) -= 1;
}

__global__ void addKernel(int *data) {
    printf("Data before add kernel: %d\n", *data);
    (*data) += 1;
}

struct UserData {
    int data;
    bool op;
};

__global__ void launchChildKernelFromDevice(void *_userData) {
    UserData *userData = (UserData *)_userData;
    bool op = userData->op;

    if (op) {
        addKernel<<<1, 1>>>((int*)userData);
    } else {
        subKernel<<<1, 1>>>((int*)userData);
    }
}

int main() {
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    UserData *userData;
    cudaMallocHost(&userData, sizeof(UserData));

    userData->data = 10;
    userData->op = true;

    // run add kernel for sanity check

    cudaStreamSynchronize(stream);
    std::cout << "Data before kernel: " << userData->data << std::endl;
    launchChildKernelFromDevice<<<1, 1, 0, stream>>>(userData);
    cudaStreamSynchronize(stream);
    std::cout << "Data after kernel: " << userData->data << std::endl;

    cudaGraph_t graph;
    cudaGraphExec_t instance;

    // Begin graph capture
    cudaStreamSynchronize(stream);
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

    // Use cuda host function to launch child kernel
    launchChildKernelFromDevice<<<1, 1, 0, stream>>>(userData);

    // End graph capture
    cudaStreamEndCapture(stream, &graph);
    cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
    
    cudaStreamSynchronize(stream);

    printf("Data after graph: %d\n", userData->data);

    // Run the graph
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);

    printf("Data after graph replay: %d\n", userData->data);

    userData->op = false;
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);

    printf("Data after graph replay with different op: %d\n", userData->data);

    cudaGraphExecDestroy(instance);
    cudaGraphDestroy(graph);
    cudaStreamDestroy(stream);
    cudaFree(userData);

    return 0;
}

yzh119 · 2024-03-23T00:24:14Z

Thanks for your explaination, that's sounds reasonable.

To proceed, I'd love to write some documentations on our dispatching rules and see if we can describe them in dynamic parallelism. Before that I have to make #75 done because it will affect our dispatching strategy.

I'll be glad to follow up next week and we can schedule a meeting on zoom (you can drop me an email at zhye@cs.washington.edu).

AgrawalAmey · 2024-03-23T00:30:43Z

Yes, that would be great, I will send out a when2meet link on email, thank you!

ZSL98 · 2024-04-03T06:01:44Z

Hi, @AgrawalAmey, will your sarathi or sarathi-serve be open-sourced?

AgrawalAmey · 2024-04-03T06:03:08Z

Hey @ZSL98, we are working with the vLLM team to get Sarathi-Serve scheduler support inside vLLM

@AgrawalAmey

As requested in #187 , this PR adds initial support of `CUDAGraph` compatibility of flashinfer batch decode attention kernels. This PR is the first step towards full CUDAGraph support and we will implement CUDAGraph compatible prefill operators in later PRs. # Proposed APIs We add another wrapper `CUDAGraphBatchDecodeWithPagedKVCacheWrapper`, and user need to pre-allocation page data structure buffers to initialize this wrapper class. Once initiated, these buffers are pinned on GPUs in the life cycle of the wrapper class. The behavior of `CUDAGraphBatchDecodeWithPagedKVCacheWrapper` is a little bit different from `BatchDecodeWithPagedKVCacheWrapper`'s: we will only run a fixed set of kernels in CUDAGraph mode, no matter what the input shape is (the original implementation will dispatch to different kernels according to different input shapes). This PR also fix the address of all kernel input pointers to accomodate the constraint of CUDAGraph capturing. # Examples See `test_cuda_graph_batch_decode_with_paged_kv_cache` in unittests. `begin_forward` functions should not be captured as some of the operators are not allowed to be captured. cc @AgrawalAmey @LiuXiaoxuanPKU @comaniac

…ntion (#275) Followup of #187 and #256

yzh119 · 2024-06-20T16:21:42Z

The CUDA graph compatibility was resolved in https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.5

The current strategy is:

If batch size is large enough, we don't use split-k and the kernel can be traced properly (we need to reuse the indptr/indices/last_page_len buffer, this was properly handled in PyTorch wrapper APIs).
If batch size is small, we use split-k, but the grid size must be fixed. We set the grid size to a fix value which is computed from the number of SMs in GPU, and we have a kernel argument block_valid_mask to determine whether to skip one threadblock's computation to achieve runtime dynamism.

AgrawalAmey · 2024-06-20T21:24:58Z

@yzh119 thanks a lot for all the amazing work! I wanted to understand split-k behaves when the sequence length is significantly different between capture and replay time. For instance, if during capture we have seq length of 1k and during replay we have a seq of length 100k, would the parallelization parameters get applied appropriately?

yzh119 · 2024-06-20T22:03:50Z

Yes, they will be properly handled.

When cudagraph is enabled, we decides whether to split-k only on batch size (for decode) and query lengths (for append), not on kv-cache length, that said, so it's safe to capture when kv-cache length is small (we have test cases for capturing for small kv-length and replay with long:

flashinfer/python/tests/test_batch_decode_kernels.py

Lines 136 to 286 in 231b1dc

    
           def test_cuda_graph_batch_decode_with_paged_kv_cache( 
        
               batch_size, 
        
               kv_len, 
        
               page_size, 
        
               num_kv_heads, 
        
               num_qo_heads, 
        
               head_dim, 
        
               kv_layout, 
        
               pos_encoding_mode, 
        
               q_dtype, 
        
               kv_dtype, 
        
           ): 
        
               q = torch.randn(batch_size, num_qo_heads, head_dim).to(0).to(q_dtype) 
        
               num_pages_per_seq = (kv_len + page_size - 1) // page_size 
        
               total_num_pages = num_pages_per_seq * batch_size 
        
               kv_data = ( 
        
                   torch.randn(total_num_pages, 2, num_kv_heads, page_size, head_dim).to(0) 
        
                   if kv_layout == "HND" 
        
                   else torch.randn(total_num_pages, 2, page_size, num_kv_heads, head_dim).to(0) 
        
               ) 
        
               kv_data_dtype = kv_data.to(kv_dtype) 
        
               kv_indptr_host_warmup = torch.arange(0, batch_size + 1).int() 
        
               kv_indices_host_warmup = torch.arange(0, batch_size).int() 
        
               kv_last_page_len_host_warmup = torch.full( 
        
                   (batch_size,), page_size, dtype=torch.int32 
        
               ) 
        
               # NOTE(Zihao): allocate more space than needed for testing 
        
               kv_indptr_device_buffer = torch.empty(batch_size + 1).int().to(0) 
        
               kv_indices_device_buffer = torch.empty(total_num_pages).int().to(0) 
        
               kv_last_page_device_buffer = torch.empty(batch_size).int().to(0) 
        
               workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8).to(0) 
        
               wrapper = flashinfer.CUDAGraphBatchDecodeWithPagedKVCacheWrapper( 
        
                   workspace_buffer, 
        
                   kv_indptr_device_buffer, 
        
                   kv_indices_device_buffer, 
        
                   kv_last_page_device_buffer, 
        
                   kv_layout, 
        
               ) 
        
               wrapper.begin_forward( 
        
                   kv_indptr_host_warmup, 
        
                   kv_indices_host_warmup, 
        
                   kv_last_page_len_host_warmup, 
        
                   num_qo_heads, 
        
                   num_kv_heads, 
        
                   head_dim, 
        
                   page_size, 
        
                   "NONE", 
        
                   data_type=kv_dtype, 
        
                   q_data_type=q_dtype, 
        
               ) 
        
               # warmup 
        
               s = torch.cuda.Stream() 
        
               s.wait_stream(torch.cuda.current_stream()) 
        
               with torch.cuda.stream(s): 
        
                   for _ in range(3): 
        
                       o = wrapper.forward(q, kv_data_dtype, pos_encoding_mode=pos_encoding_mode) 
        
               torch.cuda.current_stream().wait_stream(s) 
        
               # capture 
        
               g = torch.cuda.CUDAGraph() 
        
               with torch.cuda.graph(g): 
        
                   o = wrapper.forward(q, kv_data_dtype, pos_encoding_mode=pos_encoding_mode) 
        
               wrapper.end_forward() 
        
               # replay multiple times 
        
               for i in range(1, min(4, num_pages_per_seq)): 
        
                   kv_indptr_host = torch.arange(0, batch_size + 1).int() * i 
        
                   kv_indices_host = torch.arange(0, i * batch_size).int() 
        
                   kv_last_page_len_host = torch.full((batch_size,), page_size, dtype=torch.int32) 
        
                   wrapper.begin_forward( 
        
                       kv_indptr_host, 
        
                       kv_indices_host, 
        
                       kv_last_page_len_host, 
        
                       num_qo_heads, 
        
                       num_kv_heads, 
        
                       head_dim, 
        
                       page_size, 
        
                       "NONE", 
        
                       data_type=kv_dtype, 
        
                       q_data_type=q_dtype, 
        
                   ) 
        
                   g.replay() 
        
               # replay again 
        
               kv_indptr_host = torch.arange(0, batch_size + 1).int() * num_pages_per_seq 
        
               kv_indices_host = torch.arange(0, total_num_pages).int() 
        
               kv_last_page_len_host = torch.full( 
        
                   (batch_size,), (kv_len - 1) % page_size + 1, dtype=torch.int32 
        
               ) 
        
               wrapper.begin_forward( 
        
                   kv_indptr_host, 
        
                   kv_indices_host, 
        
                   kv_last_page_len_host, 
        
                   num_qo_heads, 
        
                   num_kv_heads, 
        
                   head_dim, 
        
                   page_size, 
        
                   "NONE", 
        
                   data_type=kv_dtype, 
        
                   q_data_type=q_dtype, 
        
               ) 
        
               g.replay() 
        
               # compute ground truth and compare 
        
               kv_indptr = kv_indptr_host.to(0) 
        
               kv_indices = kv_indices_host.to(0) 
        
               kv_last_page_len = kv_last_page_len_host.to(0) 
        
               for i in range(batch_size): 
        
                   perm_dims = [0, 2, 1, 3] if kv_layout == "HND" else [0, 1, 2, 3] 
        
                   perm_dims_last = [1, 0, 2] if kv_layout == "HND" else [0, 1, 2] 
        
                   qi = q[i] 
        
                   ki = torch.cat( 
        
                       [ 
        
                           kv_data[kv_indptr[i] : kv_indptr[i + 1] - 1, 0] 
        
                           .permute(*perm_dims) 
        
                           .reshape(-1, num_kv_heads, head_dim), 
        
                           ( 
        
                               kv_data[kv_indptr[i + 1] - 1, 0, :, : kv_last_page_len[i]] 
        
                               if kv_layout == "HND" 
        
                               else kv_data[kv_indptr[i + 1] - 1, 0, : kv_last_page_len[i], :] 
        
                           ) 
        
                           .permute(*perm_dims_last) 
        
                           .reshape(-1, num_kv_heads, head_dim), 
        
                       ], 
        
                       dim=0, 
        
                   ).to(kv_dtype) 
        
                   vi = torch.cat( 
        
                       [ 
        
                           kv_data[kv_indptr[i] : kv_indptr[i + 1] - 1, 1] 
        
                           .permute(*perm_dims) 
        
                           .reshape(-1, num_kv_heads, head_dim), 
        
                           ( 
        
                               kv_data[kv_indptr[i + 1] - 1, 1, :, : kv_last_page_len[i]] 
        
                               if kv_layout == "HND" 
        
                               else kv_data[kv_indptr[i + 1] - 1, 1, : kv_last_page_len[i], :] 
        
                           ) 
        
                           .permute(*perm_dims_last) 
        
                           .reshape(-1, num_kv_heads, head_dim), 
        
                       ], 
        
                       dim=0, 
        
                   ).to(kv_dtype) 
        
                   o_ref_i = flashinfer.single_decode_with_kv_cache( 
        
                       qi, ki, vi, pos_encoding_mode=pos_encoding_mode 
        
                   ) 
        
                   o_i_np = o[i].cpu().numpy() 
        
                   o_ref_i_np = o_ref_i.cpu().numpy() 
        
                   numpy.testing.assert_allclose(o_i_np, o_ref_i_np, rtol=1e-3, atol=1e-3)

). once batch-size/query lengths are determined, the kernel grid size are fixed and we use block valid mask for dynamic parallelism:

decode

defined:

flashinfer/include/flashinfer/attention/handler.cuh

Lines 389 to 401 in 231b1dc

    
           block_valid_mask_ = allocator.aligned_alloc<bool>(padded_batch_size * sizeof(bool), 16); 
        
           bool* block_valid_mask_h_ = 
        
               (bool*)page_locked_buffer_ + ((bool*)block_valid_mask_ - (bool*)new_indptr_); 
        
           std::fill(block_valid_mask_h_, block_valid_mask_h_ + padded_batch_size, 0); 
        
           size_t num_bytes_to_copy = (char*)allocator.ptr - (char*)new_indptr_; 
        
           FLASHINFER_CUDA_CALL(PartitionPagedKVCacheComputeAuxiliaryInfo( 
        
               max_num_pages_per_batch, batch_size, padded_batch_size, page_size, indptr, 
        
               last_page_len, (IdType*)new_indptr_h_, (IdType*)new_last_page_len_h_, 
        
               (IdType*)chunk_indptr_h_, (IdType*)batch_idx_map_h_, (IdType*)chunk_start_pos_h_, 
        
               (IdType*)seq_lengths_before_partition_h_, block_valid_mask_h_, 
        
               /*device_buffer=*/new_indptr_, 
        
               /*host_buffer=*/page_locked_buffer_, num_bytes_to_copy, stream_));

used

flashinfer/include/flashinfer/attention/decode.cuh

Line 542 in 231b1dc

if (block_valid_mask && !block_valid_mask[batch_idx]) return;

prefill/append

defined:

flashinfer/include/flashinfer/attention/handler.cuh

Lines 749 to 753 in 231b1dc

    
           bool* block_valid_mask_h_ = 
        
               (bool*)page_locked_buffer_ + ((bool*)block_valid_mask_ - (bool*)request_indices_); 
        
           for (uint32_t i = 0; i < padded_batch_size_; ++i) { 
        
             block_valid_mask_h_[i] = i < new_batch_size; 
        
           }

used:

flashinfer/include/flashinfer/attention/prefill.cuh

Lines 1365 to 1367 in 231b1dc

if (block_valid_mask && !block_valid_mask[bx]) {

return;

}

yzh119 · 2024-06-20T22:06:26Z

There is one tricky part about prefill kernels, we pass kv_chunk_size as input arguments and it doesn't work in CUDA graph mode because the input argument is fixed at capture time. So we pass a pointer to a global memory address that stores this value instead:

flashinfer/include/flashinfer/attention/prefill.cuh

Line 1362 in 231b1dc

const uint32_t kv_chunk_size = *kv_chunk_size_ptr;

and we will change its value in BeginForward functions during the generation process.

AgrawalAmey · 2024-06-20T22:42:25Z

This is great! Thanks a lot for the in-depth description. I will go ahead and add cuda graph support in sarathi-serve based on this.

This was referenced Apr 22, 2024

[Feature][Chunked Prefill]: Enable cuda graph for chunked prefill. vllm-project/vllm#4056

Open

[Kernel] Use flashinfer for decoding vllm-project/vllm#4353

Merged

yzh119 mentioned this issue May 24, 2024

perf: initial cuda graph support #256

Merged

yzh119 mentioned this issue Jun 2, 2024

feat: support cuda graph for batched multi-query(prefill/append) attention #275

Merged

yzh119 added a commit that referenced this issue Jun 2, 2024

feat: support cuda graph for batched multi-query(prefill/append) atte…

83ceb67

…ntion (#275) Followup of #187 and #256

yzh119 mentioned this issue Jun 7, 2024

[DO NOT MERGE] cudagraph: use cuda dynamic parallelism to dispatch kernels #288

Closed

yzh119 closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make flashinfer kernels cuda graphs friendly #187

Make flashinfer kernels cuda graphs friendly #187

AgrawalAmey commented Mar 20, 2024 •

edited

Loading

yzh119 commented Mar 22, 2024

AgrawalAmey commented Mar 23, 2024

AgrawalAmey commented Mar 23, 2024

yzh119 commented Mar 23, 2024 •

edited

Loading

AgrawalAmey commented Mar 23, 2024

ZSL98 commented Apr 3, 2024

AgrawalAmey commented Apr 3, 2024

yzh119 commented Jun 20, 2024

AgrawalAmey commented Jun 20, 2024

yzh119 commented Jun 20, 2024

yzh119 commented Jun 20, 2024

AgrawalAmey commented Jun 20, 2024

Make flashinfer kernels cuda graphs friendly #187

Make flashinfer kernels cuda graphs friendly #187

Comments

AgrawalAmey commented Mar 20, 2024 • edited Loading

yzh119 commented Mar 22, 2024

AgrawalAmey commented Mar 23, 2024

AgrawalAmey commented Mar 23, 2024

yzh119 commented Mar 23, 2024 • edited Loading

AgrawalAmey commented Mar 23, 2024

ZSL98 commented Apr 3, 2024

AgrawalAmey commented Apr 3, 2024

yzh119 commented Jun 20, 2024

AgrawalAmey commented Jun 20, 2024

yzh119 commented Jun 20, 2024

yzh119 commented Jun 20, 2024

AgrawalAmey commented Jun 20, 2024

AgrawalAmey commented Mar 20, 2024 •

edited

Loading

yzh119 commented Mar 23, 2024 •

edited

Loading