Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(deps): update dependency com_github_nvidia_cutlass to v3.5.1 (#838
) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [com_github_nvidia_cutlass](https://togithub.com/NVIDIA/cutlass) | http_archive | patch | `v3.5.0` -> `v3.5.1` | --- ### Release Notes <details> <summary>NVIDIA/cutlass (com_github_nvidia_cutlass)</summary> ### [`v3.5.1`](https://togithub.com/NVIDIA/cutlass/releases/tag/v3.5.1): CUTLASS 3.5.1 [Compare Source](https://togithub.com/NVIDIA/cutlass/compare/v3.5.0...v3.5.1) - [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](./examples/cute/tutorial/wgmma_sm90.cu). - [Exposure of L2 `cache_hint`s in TMA copy atoms](./include/cute/arch/copy_sm90\_tma.hpp#L48) - Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/profiler.md#GEMM), and [example 48](./examples/48\_hopper_warp_specialized_gemm/48\_hopper_warp_specialized_gemm.cu). - [TMA store based and EVT supported epilogues](./include/cutlass/epilogue/collective/sm90\_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](./test/unit/gemm/device/sm90\_gemm_f16\_f16\_f16\_tensor_op_f32\_ptr_array.cu). - A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference. - [CUDA host adapter](./include/cutlass/cuda_host_adapter.hpp) extensions to support TMA descriptor construction driver APIs. - Inclusion of more [Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler](./python/cutlass_library/generator.py). - Support for residual add (beta != 0) in convolution kernels. - A new convolution [epilogue](./examples/16\_ampere_tensorop_conv2dfprop/ampere_tensorop_conv2dfprop.cu#L269) for CUTLASS 2.x to support non-packed NHWC output. - A refactor of [include files throughout CUTLASS core directories](./include/cutlass/gemm/collective/collective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](./test/self_contained_includes/CMakeLists.txt). - [A guide for setting up VSCode to work well with CUTLASS](./media/docs/ide_setup.md) and [expanded code style guide](./media/docs/programming_guidelines.md). - Better support for MSVC as a host compiler. - Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2. - Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1. - NOTICE: - Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` API to bring it in line with `gemm::GemmUniversal`. After this, the 3.x convolution API will no longer be considered as a beta API. - Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/secretflow/spu). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC41Ni4wIiwidXBkYXRlZEluVmVyIjoiMzguNTYuMCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
- Loading branch information