cudnn frontend v1.9 release notes
New API
Enhancements to flash attention API
-
SDPA_attributes
andSDPA_bprop_attributes
now accepts a score_mod function throughset_score_mod
andset_score_mod_bprop
API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for fprop and bprop. -
Added support for THD format and sliding window mask.
-
Added support for THD format and Bottom right causal mask.
-
Added support for bottom right causal masking with sliding window mask
-
Added a new parameter called
set_max_total_seq_len_q/set_max_total_seq_len_kv
on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.
Improvements
-
Allow creation of serialized json for dgrad, wgrad and resample operations.
-
Added more diagnostic message when the compiled version of cudnn does not match the run-time version of cudnn.
Bug fixes
-
Fixed an issue where log messages unparseable data at the end of messages.
-
Fixed an issue where while building the python pip wheel would hang.
-
Fixed natively creating cuda graphs for SDPA with alibi masks.
New samples
- Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.