Downloadable Package in PyPI #153

WoosukKwon · 2024-03-04T20:56:04Z

Thanks again for the nice project! Are you interested in uploading the wheels (for CUDA 12.1) to PyPI? This will help users manage the dependency on the FlashInfer library.

zhyncs · 2024-03-05T05:32:36Z

Thanks again for the nice project! Are you interested in uploading the wheels (for CUDA 12.1) to PyPI? This will help users manage the dependency on the FlashInfer library.

@WoosukKwon Perhaps we may temporarily resolve this by using the command such as

pip3 install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.0.2/flashinfer-0.0.2+cu121torch2.1-cp39-cp39-linux_x86_64.whl

@yzh119 If we want to support the workflow of PyPI, we may refer to https://github.com/InternLM/lmdeploy/blob/main/.github/workflows/pypi.yml.

yzh119 · 2024-03-05T14:38:25Z

Hi @WoosukKwon , thanks for the suggestion, my only concern is the binary size, considering there are many different combinations of python version+cuda version+pytorch version (each wheel is ~500mb), and I used to get warnings because of the large binary size. Did vllm upload all wheels to PyPI?

@zhyncs thanks for your reference:

Perhaps we may temporarily resolve this by using the command such as

PyPI has a unique advantage that other packages can set flashinfer as their dependencies, and I do think it's preferable to upload flashinfer to PyPI.

WoosukKwon · 2024-03-06T00:41:03Z

@yzh119 I see. What we need at the moment are the Python 3.8-3.11 wheels built for PyTorch 2.1.2 + CUDA 12.1. However, we do agree that maintaining compatibility between the two libraries is quite tricky.

Alternatively, we're currently considering importing FlashInfer as a submodule and building the kernels by ourselves. However, we found that the compilation time of FlashInfer is too long (30+ mins on our machine). Do you have any idea to reduce the time?

WoosukKwon · 2024-03-06T07:18:05Z

@yzh119 Also, do you mind if the vLLM team hosts specific PyTorch + CUDA versions of FlashInfer in PyPI under the name of vllm-flashinfer-mirror or something like that? This will give us more control over the compatibility issue.

zhyncs · 2024-03-06T08:36:24Z

@yzh119 Also, do you mind if the vLLM team hosts specific PyTorch + CUDA versions of FlashInfer in PyPI under the name of vllm-flashinfer-mirror or something like that? This will give us more control over the compatibility issue.

Sounds good.

zhyncs · 2024-08-27T05:55:05Z

FlashInfer currently requires support for Python 3.8 to 3.12, CUDA 11.8, 12.1, and 12.4, as well as Torch 2.1 to 2.4. The number and size of whl is exceeding PyPI's limits. Please follow the recommended installation method at https://docs.flashinfer.ai/installation.html

This PR implements the JIT compilation (#170 ) of flashinfer, after this PR, flashinfer will compile kernels just-in-time for different input data types and shapes, and cached the kernels at the disk, instead of pre-compile a set of kernels in the wheel. # Motivation The pip wheel size is exploding as we add support to more data types, more head dimensions, more attention variants and more kernel implementation. Pre-compile everything is not sustainable, and impedes development speed. This PR refactors the codebase to use torch's [JIT Compiling Extensions](https://pytorch.org/tutorials/advanced/cpp_extension.html#jit-compiling-extensions) feature instead of pre-compile kernels in the wheel. ## Attention Variants We learned from [FlexAttention](https://pytorch.org/blog/flexattention/) and describes every attention variant as a template class, each instance of the struct can carry some closure variable defined in local memory or shared memory, below are two examples (logits soft cap and alibi attention, the programming interface is tentative and will be updated as we improve the programmability of the JIT template): ```cuda template <typename ParamsT> struct LogitsSoftCap { using DTypeQ = typename ParamsT::DTypeQ; using DTypeKV = typename ParamsT::DTypeKV; using DTypeO = typename ParamsT::DTypeO; uint32_t qo_len, kv_len; uint32_t window_left; __device__ __host__ LogitsSoftCap(const ParamsT& params, uint32_t batch_idx, uint8_t* smem_ptr) { qo_len = params.get_qo_len(batch_idx); kv_len = params.get_kv_len(batch_idx); window_left = kv_len; } template <typename T> __device__ __forceinline__ T QueryTransform(const ParamsT& params, T q) { return float(q) * params.sm_scale * math::ptx_rcp(params.logits_soft_cap); } template <typename T> __device__ __forceinline__ T LogitsTransform(const ParamsT& params, T logits, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return params.logits_soft_cap * math::log2e * float(math::tanh(logits)); } __device__ __forceinline__ bool LogitsMask(const ParamsT& params, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return true; } }; template <typename ParamsT> struct ALIBIAttention { using DTypeQ = typename ParamsT::DTypeQ; using DTypeKV = typename ParamsT::DTypeKV; using DTypeO = typename ParamsT::DTypeO; using IdType = typename ParamsT::IdType; uint32_t qo_len, kv_len; uint32_t window_left; __device__ __host__ ALIBIAttention(const ParamsT& params, uint32_t batch_idx, uint8_t* smem_ptr) { qo_len = params.get_qo_len(batch_idx); kv_len = params.get_kv_len(batch_idx); window_left = kv_len; } template <typename T> __device__ __forceinline__ T QueryTransform(const ParamsT& params, T q) { return float(q) * params.sm_scale * math::log2e; } template <typename T> __device__ __forceinline__ T LogitsTransform(const ParamsT& params, T logits, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return logits + params.alibi_slopes[qo_head_idx] * float(int(kv_idx) - int(qo_idx)); } __device__ __forceinline__ bool LogitsMask(const ParamsT& params, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return true; } }; ``` User can customize their own `ParamsT` class and variants class to define their own attention variants, we hope such refactor will make the codebase more concise and extensive. # Roadmap After this PR, we will add support for: 1. PyPI wheels #153 2. fp8 tensor cores attention: #502 3. different head dimensions: #142 #454 #455 4. flashattention3 #369 5. multi-head latency attention #237 6. Generate ParamsT and Attention variants description from python dsl The development of this features have been blocked by the limitation of wheel size (binary size >= 2GB will trigger some linking issues), I hope this PR will make development easier in the future.

zhyncs closed this as completed Aug 27, 2024

yzh119 mentioned this issue Sep 25, 2024

feat: JIT compilation #507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloadable Package in PyPI #153

Downloadable Package in PyPI #153

WoosukKwon commented Mar 4, 2024

zhyncs commented Mar 5, 2024 •

edited

Loading

yzh119 commented Mar 5, 2024

WoosukKwon commented Mar 6, 2024

WoosukKwon commented Mar 6, 2024

zhyncs commented Mar 6, 2024

zhyncs commented Aug 27, 2024

Downloadable Package in PyPI #153

Downloadable Package in PyPI #153

Comments

WoosukKwon commented Mar 4, 2024

zhyncs commented Mar 5, 2024 • edited Loading

yzh119 commented Mar 5, 2024

WoosukKwon commented Mar 6, 2024

WoosukKwon commented Mar 6, 2024

zhyncs commented Mar 6, 2024

zhyncs commented Aug 27, 2024

zhyncs commented Mar 5, 2024 •

edited

Loading