Skip to content

Releases: NVIDIA/cudnn-frontend

v1.9.0 release

20 Dec 19:22
ee971b1
Compare
Choose a tag to compare

cudnn frontend v1.9 release notes

New API

Enhancements to flash attention API

  • SDPA_attributes and SDPA_bprop_attributes now accepts a score_mod function through set_score_mod and set_score_mod_bprop API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for fprop and bprop.

  • Added support for THD format and sliding window mask.

  • Added support for THD format and Bottom right causal mask.

  • Added support for bottom right causal masking with sliding window mask

  • Added a new parameter called set_max_total_seq_len_q/set_max_total_seq_len_kv on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.

Improvements

  • Allow creation of serialized json for dgrad, wgrad and resample operations.

  • Added more diagnostic message when the compiled version of cudnn does not match the run-time version of cudnn.

Bug fixes

  • Fixed an issue where log messages unparseable data at the end of messages.

  • Fixed an issue where while building the python pip wheel would hang.

  • Fixed natively creating cuda graphs for SDPA with alibi masks.

New samples

  • Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.

v1.8.0 release

23 Oct 18:44
936021b
Compare
Choose a tag to compare

cudnn frontend v1.8 release:

New API

Paged Attention API

SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes now accepts set_paged_attention_k_table and set_paged_attention_v_table to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper

cuda Graph API

cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:

  • populate_cuda_graph : add the cudnn nodes to the empty cuda_graph provided as input.
  • update_cuda_graph : update the populated cuda graph with necessary data pointers.
    See docs and backend documentation for more details.

Enhancements

  • Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.

  • graph.deselect_engines(str: ) has now a python equivalent through pybind11.

  • graph.tensor(...) can now accept int64_t scalars directly. (Previously limited to int32_t, float and fp16 data types).

  • fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.

  • More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.

  • SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.

Bug fixes

  • Fixed an issue while querying cudnnGetLastErrorString() from the backend. The error_t object will now have more meaningful message.

  • Fixed build issues seen with clang-19 compiler.

  • Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.

cudnn FE 1.7.0 Release

23 Sep 20:53
de355c7
Compare
Choose a tag to compare

cudnn FE 1.7.0 Release notes:

New API

  • Kernel Cache support for dynamic graphs Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to documentation for API details.

Added examples Convolution fprop dynamic shape, CSBR Graph dynamic shape, Matmul dynamic shape and Bias + Matmul dynamic shape to showcase use of dynamic shapes and kernel cache.

  • Two new APIs to describe the plan in the form engine number and knobs are introduced.
error_t
get_plan_name(std::string &name) const;

error_t
get_plan_name_at_index(int64_t plan_index, std::string &name) const;

Note:
This name can be used later if you want to deselect_plan_by_name, if run into any potential errors.

  • Added an API to query tensor attributes from its UID in a graph. query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;

Improvements

  • sdpa fp16 bprop node can now compute dbias when padding mask is enabled (requires cudnn 9.4.0 and above).

  • sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask(requires cudnn 9.4.0 and above).

  • Matmul fp8 node can now accept M,N,K overrides.

  • Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN.

  • Updated benchmark numbers with cudnn 9.4.0 for fp16 and fp8 datatypes.

  • Fixed compilation issues when NV_CUDNN_DISABLE_EXCEPTION is enabled.

Bug fixes

  • Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead.

  • Fixed incorrect SDPA stats stride inferencing.

  • Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.

cudnn FE 1.6.1 release

20 Aug 04:14
2533f5e
Compare
Choose a tag to compare

Bug fix

  • Fixed an issue where custom dropout mask was not correctly applied.
  • Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
  • Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
  • Fixed an issue in sdpa fp8 fprop operation (in inference mode).

Samples

  • Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
  • Added a sample to shocase convolutions on large (c * d * h * w > 2 **31) tensors.

v1.6.0 release

12 Aug 23:17
23511ba
Compare
Choose a tag to compare

Release notes:

New API

  • Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
  • SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.

Bug Fixes

  • Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
  • SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.

Enhancements

  • Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
  • Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
  • Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
  • Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
  • Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
  • JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
  • Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
  • CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.

Samples

  • Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.

v1.5.2 release

25 Jun 23:18
98ca4e1
Compare
Choose a tag to compare

[Enhancement] Allows stride value of 0 indicating repetition of tensor in those dimensions.

v1.5.1 release

18 Jun 01:48
aa3abd4
Compare
Choose a tag to compare

v1.5.1

[Bug fix] Fixed an issue, where cudnn-frontend (1.5.0) when built with cudnn version 9.1.1 and below, runs into issues when run with 9.2.0 and
above.

v1.5.0 release

13 Jun 08:48
47d800c
Compare
Choose a tag to compare

[New feature] With cudnn backend 9.2.0 and above, Graph::check_support can determine support check for runtime engines without invoking the nvrtc compiler. This allows users to check the support surface of cudnn without invoking the nvrtc compilation.

[New feature] Python pip wheel now contains the necessary c++ development headers.

[New feature] Sliding window attention is now supported as an attribute to the sdpa forward and bprop node. Usage:
sdpa_attributes.set_sliding_window_length(window_length)

[New feature] Bottom right aligned causal masking is now supported as an attribute to the sdpa forward and bprop node. Usage: sdpa_attributes.use_causal_mask_bottom_right(true)

[New feature] SDPA bprop attributes can choose deterministic algorithm using the use_deterministic_algorithm API.

[New feature] Allow users to filter candidate execution plans of graph by its shared memory usage in cudnn 9.2.0 and later.

[Bug fix] A runtime error if chosen execution plan candidate is incorrectly set in the backend has been fixed. This would happen when check_support does not correctly filter by the workspace size.

[Bug fix] selecting/deselecting by behavior and numerical notes has now been fixed and works as intended.

[Debugging] A new tool for easy reproduction of a failure using the json representation of the graph can be found here.

[Samples] Restructured the cpp samples into categories for easier navigation.

[Samples] Added a sample to showcase how different plans can be built in parallel in separate threads.

[Compilation enhancement] Added a new macro
CUDNN_FRONTEND_SKIP_NLOHMANN_JSON as compilation flag to not have nlohman::json as compilation dependency. Users lose access to certain API functions like print, key, serialize, deserialzie that depend on the library.

[Enhancement] Serialization of resample operation is now supported.

[Enhancement] Bug template has been added for new github issues

v1.4.0 release

07 May 16:54
b740542
Compare
Choose a tag to compare

[New] Added a benchmark folder which contains a sample docker file to compare cudnn implementation of sdpa with that of the pytorch implementation.

[Enhancement] Once an engine is de-selected by name, it will not be built as part of check support.

[Enhancement] The cudnn backend search order for wheels is as follows: (a) It will dlopen libcudnn.so.MAJOR_VERSION in the site packages. (b) It will try to dlopen unversioned libcudnn.so. This way pypi cudnn package nvidia-cudnn-cu* gets priority over default search path.

[Enhancement] Allow embedding dimension up to 256 (currently limited to 128) in sdpa fprop operation.

[Bug fix] Update the scale and bias shapes in batch norm sample.

v1.3.0 release

10 Apr 17:51
1b0b5ea
Compare
Choose a tag to compare

[New API] Added new operations sdpa_fp8_forward and sdpa_fp8_backward to perform scaled dot prodcut attention of fp8 tensors. See more details in the docs/operations/Attention.md and cpp sample in samples/cpp/mha.cpp. Pybinds for the fp8 nodes are also added.

[New API] Added new operation for resample forward operation. Add a new sample samples/cpp/resample.cpp to show its usage.

[New API] Add a new API deselect_engines(std::vector<std::string> const &engine_names) which blocks certain engine configs from running.

[New API] Add new APIs select_numeric_notes and select_behavior_notes to allow user select engine configs which have the selected numeric and behavior notes respectively.

[Python API] Added a custom exception cudnnGraphNotSupportedException to the python API to distinguish between graphs that are actually not supported as compared to programming errors.

[Python API] Added a new backend_version_string which returns the backend version in canonical form (eg. 9.1.0) instead of a version number.

[Bug Fix] Fixed issues with compilation on clang19 and c++20 standard.

[Bug Fix] Updated the workspace computation for sdpa fprop node. Previously, workspace was calculated for alibi slopes irrespective of whether alibi mask was turned on or not.

[Bug Fix] Fixed deserialization of fused scalars.