[tx] Add cuDNN flash attention by raulchen · Pull Request #879 · NovaSky-AI/SkyRL

raulchen · 2026-01-15T00:13:55Z

Summary

Add dot_product_attention() in tx/layers/attention.py that
uses cuDNN on GPU for both causal (prefill) and non-causal
(decode) attention, with XLA fallback for CPU/TPU
Update Llama3 and Qwen3 models to use the shared attention
function
Add GPU CI workflow via Anyscale

Test plan

GPU tests verify cuDNN output matches XLA mask-based
attention
Tests cover: causal with padding, no padding, mixed lengths,
GQA, decode (non-causal)
CPU CI excludes GPU tests (--ignore=tests/gpu)

gemini-code-assist

Code Review

This pull request introduces an excellent optimization by adding support for cuDNN flash attention, which will significantly reduce memory usage during training and prefill on GPUs. The implementation is robust, automatically handling both left- and right-padded sequences by cleverly converting them to the right-padded format required by cuDNN. The fallback to a standard mask-based attention for the decode phase and non-GPU backends is a sound design choice. The new functionality is accompanied by a comprehensive set of unit tests for both CPU and GPU, ensuring numerical correctness across different scenarios. The addition of a dedicated GPU CI workflow is also a great step towards maintaining the stability of this feature. I have one suggestion to further improve the test coverage for an edge case.

skyrl-tx/tests/gpu/test_attention.py

raulchen · 2026-01-21T00:53:13Z

/gemini review

gemini-code-assist

Code Review

The pull request introduces cuDNN flash attention support for JAX models, significantly reducing memory usage for causal attention during training and prefill. It includes a new shared attention module that intelligently selects between cuDNN flash attention and a mask-based fallback, handling left-padded sequences by internally converting them to a right-padded format. New GPU CI workflows and comprehensive tests for the new attention mechanism and sequence shifting utility have also been added. The changes are well-structured and integrate smoothly into the existing model architectures.

skyrl-tx/tx/models/attention.py

skyrl-tx/ci/gpu_ci_run.sh

skyrl-tx/tests/gpu/test_attention.py

skyrl-tx/tx/layers/attention.py

- Use seq_lengths instead of attention_mask for attention computation - On GPU: use cuDNN flash attention with query_seq_lengths/key_value_seq_lengths - On CPU/TPU: fall back to mask-based attention (construct mask from seq_lengths) - cuDNN flash attention provides O(seq) memory vs O(seq²) for standard attention Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Extract shared attention logic to tx/models/attention.py - Use cuDNN flash attention only for right-padded sequences on GPU - Fall back to mask-based attention for left-padded (generation) or CPU/TPU - Fixes generation bug where cuDNN received wrong valid positions

Shift left-padded sequences to right-padded before applying cuDNN flash attention, then shift output back. This enables O(S^2) -> O(S) memory savings for inference prefill while keeping mask-based attention for decode (where flash attention provides minimal benefit). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use argmax to find first valid token position (0 for right-padded, >0 for left-padded) - Always apply shift (no-op when shift=0), avoiding dual-branch compilation - Document that attention_mask must have at least one valid token per batch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add CPU tests for _shift_sequences and basic attention (tests/models/) - Add GPU tests for cuDNN vs mask-based numerical equivalence (tests/gpu/) - Add gpu_skyrl_tx.yaml workflow using Anyscale for GPU testing - Update cpu_skyrl_tx.yaml to exclude tests/gpu/ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

) This is in preparation for merging #879. Updated version of #918, rebased on top of #919 The reason we do this is, while it is slightly less natural for sampling prefill to be left aligned (right padded), it makes things more uniform, since that's the same alignment used during training, and the jax cudnn flash attention doesn't currently support left padding as far as we are aware. It is a small change and actually a little simpler, so what is not to like about it. Plus the work on the PR uncovered a bug in the prompt logprobs.

Resolve conflicts in llama3.py and qwen3.py by keeping both dot_product_attention and LogitsProcessorMixin imports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Main branch updated inference to use right padding, so the shifting logic to convert left-padded to right-padded format is no longer needed. - Remove _shift_sequences() function from attention.py - Simplify dot_product_attention() to directly use cuDNN - Remove CPU tests for shift_sequences - Update GPU tests to only test right-padded sequences Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…nto flash-attention

pcmoritz · 2026-01-23T20:00:03Z

skyrl-tx/ci/anyscale_gpu_ci.yaml

+ray_version: "2.51.1"
+compute_config: l4_ci
+working_dir: .
+env_vars:


I'm pretty sure this is not needed any more on modern Ray versions :)

yeah, copied from other files. removed all

In #927, we left aligned the prompts for sampling prefill in preparation for #879. Here we shift them after the prefill so they will again be right aligned for the decoding. This will allow us to use the cudnn attention #927 for both prefill and decoding. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

This PR writes the new decoded token into the kv cache in such a way that the whole sequence is left aligned. This is needed so that the CUDNN attention #879 truly works without attention mask.

pcmoritz · 2026-01-24T20:13:53Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant performance improvement by adding support for cuDNN's flash attention on GPUs. The new dot_product_attention utility is well-designed, providing a clean abstraction with a fallback for other backends. The models are correctly updated to use this new utility, and the changes are accompanied by a comprehensive suite of GPU-specific tests that validate the numerical equivalence with the existing XLA implementation.

My review focuses on the new test code, where I've suggested a couple of improvements for maintainability and adherence to modern JAX practices. Overall, this is a solid contribution.

gemini-code-assist · 2026-01-24T20:15:48Z

skyrl-tx/tests/gpu/test_attention.py

+    q = jax.random.normal(jax.random.key(0), (batch, seq_len, num_heads, head_dim), dtype=dtype)
+    k = jax.random.normal(jax.random.key(1), (batch, seq_len, num_kv_heads, head_dim), dtype=dtype)
+    v = jax.random.normal(jax.random.key(2), (batch, seq_len, num_kv_heads, head_dim), dtype=dtype)


jax.random.key is an alias for jax.random.PRNGKey and its usage is discouraged in new code. Please use jax.random.PRNGKey for clarity and future compatibility.

Suggested change

q = jax.random.normal(jax.random.key(0), (batch, seq_len, num_heads, head_dim), dtype=dtype)

k = jax.random.normal(jax.random.key(1), (batch, seq_len, num_kv_heads, head_dim), dtype=dtype)

v = jax.random.normal(jax.random.key(2), (batch, seq_len, num_kv_heads, head_dim), dtype=dtype)

q = jax.random.normal(jax.random.PRNGKey(0), (batch, seq_len, num_heads, head_dim), dtype=dtype)

k = jax.random.normal(jax.random.PRNGKey(1), (batch, seq_len, num_kv_heads, head_dim), dtype=dtype)

v = jax.random.normal(jax.random.PRNGKey(2), (batch, seq_len, num_kv_heads, head_dim), dtype=dtype)

gemini-code-assist · 2026-01-24T20:15:48Z

skyrl-tx/tests/gpu/test_attention.py

+        q = jax.random.normal(jax.random.key(0), (batch, 1, num_heads, head_dim), dtype=jnp.bfloat16)
+        k = jax.random.normal(jax.random.key(1), (batch, kv_len, num_heads, head_dim), dtype=jnp.bfloat16)
+        v = jax.random.normal(jax.random.key(2), (batch, kv_len, num_heads, head_dim), dtype=jnp.bfloat16)


jax.random.key is an alias for jax.random.PRNGKey and its usage is discouraged in new code. Please use jax.random.PRNGKey for clarity and future compatibility.

Additionally, this block duplicates tensor creation logic from make_qkv and test_gqa_decode. Consider refactoring make_qkv to accept separate q_len and kv_len to centralize this logic and improve maintainability.

Suggested change

q = jax.random.normal(jax.random.key(0), (batch, 1, num_heads, head_dim), dtype=jnp.bfloat16)

k = jax.random.normal(jax.random.key(1), (batch, kv_len, num_heads, head_dim), dtype=jnp.bfloat16)

v = jax.random.normal(jax.random.key(2), (batch, kv_len, num_heads, head_dim), dtype=jnp.bfloat16)

q = jax.random.normal(jax.random.PRNGKey(0), (batch, 1, num_heads, head_dim), dtype=jnp.bfloat16)

k = jax.random.normal(jax.random.PRNGKey(1), (batch, kv_len, num_heads, head_dim), dtype=jnp.bfloat16)

v = jax.random.normal(jax.random.PRNGKey(2), (batch, kv_len, num_heads, head_dim), dtype=jnp.bfloat16)

gemini-code-assist · 2026-01-24T20:15:48Z

skyrl-tx/tests/gpu/test_attention.py

+        q = jax.random.normal(jax.random.key(0), (batch, 1, num_heads, head_dim), dtype=jnp.bfloat16)
+        k = jax.random.normal(jax.random.key(1), (batch, kv_len, num_kv_heads, head_dim), dtype=jnp.bfloat16)
+        v = jax.random.normal(jax.random.key(2), (batch, kv_len, num_kv_heads, head_dim), dtype=jnp.bfloat16)


jax.random.key is an alias for jax.random.PRNGKey and its usage is discouraged in new code. Please use jax.random.PRNGKey for clarity and future compatibility.

As mentioned in test_decode, this is another instance of duplicated tensor creation logic. Refactoring make_qkv would benefit this test as well.

Suggested change

q = jax.random.normal(jax.random.key(0), (batch, 1, num_heads, head_dim), dtype=jnp.bfloat16)

k = jax.random.normal(jax.random.key(1), (batch, kv_len, num_kv_heads, head_dim), dtype=jnp.bfloat16)

v = jax.random.normal(jax.random.key(2), (batch, kv_len, num_kv_heads, head_dim), dtype=jnp.bfloat16)

q = jax.random.normal(jax.random.PRNGKey(0), (batch, 1, num_heads, head_dim), dtype=jnp.bfloat16)

k = jax.random.normal(jax.random.PRNGKey(1), (batch, kv_len, num_kv_heads, head_dim), dtype=jnp.bfloat16)

v = jax.random.normal(jax.random.PRNGKey(2), (batch, kv_len, num_kv_heads, head_dim), dtype=jnp.bfloat16)

…ovaSky-AI#927) This is in preparation for merging NovaSky-AI#879. Updated version of NovaSky-AI#918, rebased on top of NovaSky-AI#919 The reason we do this is, while it is slightly less natural for sampling prefill to be left aligned (right padded), it makes things more uniform, since that's the same alignment used during training, and the jax cudnn flash attention doesn't currently support left padding as far as we are aware. It is a small change and actually a little simpler, so what is not to like about it. Plus the work on the PR uncovered a bug in the prompt logprobs.

In NovaSky-AI#927, we left aligned the prompts for sampling prefill in preparation for NovaSky-AI#879. Here we shift them after the prefill so they will again be right aligned for the decoding. This will allow us to use the cudnn attention NovaSky-AI#927 for both prefill and decoding. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

This PR writes the new decoded token into the kv cache in such a way that the whole sequence is left aligned. This is needed so that the CUDNN attention NovaSky-AI#879 truly works without attention mask.

## Summary - Add `dot_product_attention()` in `tx/layers/attention.py` that uses cuDNN on GPU for both causal (prefill) and non-causal (decode) attention, with XLA fallback for CPU/TPU - Update Llama3 and Qwen3 models to use the shared attention function - Add GPU CI workflow via Anyscale ## Test plan - [ ] GPU tests verify cuDNN output matches XLA mask-based attention - [ ] Tests cover: causal with padding, no padding, mixed lengths, GQA, decode (non-causal) - [ ] CPU CI excludes GPU tests (`--ignore=tests/gpu`) Benchmark results NovaSky-AI#891 (comment) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

skyrl-tx/tests/gpu/test_attention.py Outdated Show resolved Hide resolved

pcmoritz added the tx label Jan 15, 2026

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

skyrl-tx/tx/models/attention.py Outdated Show resolved Hide resolved

skyrl-tx/ci/gpu_ci_run.sh Outdated Show resolved Hide resolved

skyrl-tx/tests/gpu/test_attention.py Outdated Show resolved Hide resolved

skyrl-tx/tx/layers/attention.py Outdated Show resolved Hide resolved

raulchen mentioned this pull request Jan 21, 2026

[tx] support long sequence #891

Open

raulchen and others added 14 commits January 20, 2026 18:54

lint

5fbfefe

refine TestShiftSequences

defed63

remove abbreviations

775b746

refine TestDotProductAttentionCPU

1174531

remove unnecessary CPU tests

7f88099

reduce duplication

ffed524

lint

a244a09

address comments

dd63e80

lint

c8e7095

raulchen force-pushed the flash-attention branch 2 times, most recently from 91b1fc0 to c8e7095 Compare January 21, 2026 02:54

raulchen added 3 commits January 21, 2026 11:27

remove assertion

13922a9

move files

a242427

add todo

48d07db

This was referenced Jan 22, 2026

[tx] Use right padding of sequences during sampling prefill #918

Closed

[tx] Use right padding of sequences during sampling prefill, redux #927

Merged

raulchen and others added 3 commits January 23, 2026 08:13

Merge main into flash-attention

e84c8c7

Resolve conflicts in llama3.py and qwen3.py by keeping both dot_product_attention and LogitsProcessorMixin imports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

simplify

2e0bb7c

raulchen added 4 commits January 23, 2026 09:55

fix dtype

ef5c22f

remove duplication

76dd7f2

Merge branch 'flash-attention' of https://github.com/raulchen/SkyRL i…

fb067ee

…nto flash-attention

fix

28117db

pcmoritz reviewed Jan 23, 2026

View reviewed changes

raulchen added 4 commits January 23, 2026 12:43

test f32

5c3ef62

remove hook

d468ccb

fix

061f0ca

gqa tests

eae82ff

pcmoritz mentioned this pull request Jan 24, 2026

[tx] Right align prompts for decoding #932

Merged

Merge branch 'main' into flash-attention

748ca39

pcmoritz mentioned this pull request Jan 24, 2026

[tx] Left align generated tokens in decoding #933

Merged

Merge branch 'main' into flash-attention

6920b6d

pcmoritz approved these changes Jan 24, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

pcmoritz merged commit 4feedf0 into NovaSky-AI:main Jan 24, 2026
4 of 7 checks passed

This was referenced Jan 26, 2026

[tx] (Not for merging) Test cudnn backend for attention implementation #777

Closed

[tx] DeepseekV3 implementation #889

Merged

Conversation

raulchen commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

raulchen commented Jan 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

raulchen Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Jan 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raulchen commented Jan 15, 2026 •

edited

Loading