[Docs] Add reinforcement learning example illustrating gpu-to-gpu RDT and GRPO. #57961

crypdick · 2025-10-21T19:52:33Z

Description

Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers.

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

stephanie-wang · 2025-10-22T03:26:17Z

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

+
+@ray.remote
+class ReplayBuffer:
+    """Storage for scored trajectory slices."""


Can you expand on the docstring a bit more? For people who are not as familiar with RL.

Would be good to say how sampling is done, when samples are kicked out, etc.

updated the docstrings. lmk what you think

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

stephanie-wang · 2025-10-22T03:36:41Z

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

+    def __init__(self, replay_buffer) -> None:
+        self.model = MLP().to("cuda")
+
+        # Maintain a frozen EMA teacher of the policy for KL computation


I thought original PPO paper suggested not using the KL divergence term?

I'd prefer to remove it assuming this particular example still works. Have you tried running without it to see if it will converge? If that doesn't work, the 1 weight update step behind seems simpler.

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor · 2025-10-27T23:06:04Z

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

+        # Update the generator with new weights.
+        weights_updated_ref = generator.update_weights.remote(
+            learner.get_weights.remote()
+        )


Bug: Generator Uses Stale Weights Due to Timing

Race condition in the training loop: generator.generate.remote(states) is called at line 385 before ray.wait([weights_updated_ref]) at line 390. This means that in each iteration (except the first), the generate call is queued on the single-threaded Generator actor before confirming that the previous weight update has completed. While the actor processes tasks sequentially, the generate task gets queued before the ray.wait confirms the update is done, potentially causing the generator to use stale weights. The generate call should be moved to after the ray.wait to ensure the generator always uses the most recent weights.

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor · 2025-10-29T19:52:05Z

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

+# -- Utilities --
+def sample_unit_vector(batch_size: int, dim: int = STATE_DIM) -> torch.Tensor:
+    """Sample unit vectors of shape [batch_size, dim] by normalizing Gaussian draws."""
+    assert batch_size > 1, "Batch size must be greater than 1"


Bug: Unit Vector Function Restricts Single Batch Usage

The sample_unit_vector function's assertion batch_size > 1 is overly restrictive. The vector normalization logic works correctly for batch_size=1, unnecessarily limiting the function's utility and potentially causing failures in valid use cases.

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor · 2025-10-29T21:27:56Z

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py

+# -- Utilities --
+def sample_unit_vector(batch_size: int, dim: int = STATE_DIM) -> torch.Tensor:
+    """Sample unit vectors of shape [batch_size, dim] by normalizing Gaussian draws."""
+    assert batch_size > 1, "Batch size must be greater than 1"


Bug: Unit Vector Sampling Restriction Error

The sample_unit_vector function includes an assertion batch_size > 1 that is overly restrictive. The function's logic correctly handles batch_size=1 for sampling a single unit vector, so this assertion prevents valid usage without an apparent algorithmic or mathematical reason.

… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

daiping8 · 2025-11-19T02:21:12Z

Hello!I tested the code on a cluster with 1 head and 1 worker node, each equipped with an A800 GPU. After testing different transport backends, I obtained the following results. According to the data, there is no significant difference in execution time among the various transport backends. Is this situation normal? Thank you !!!

================================================================================
SUMMARY RESULTS
================================================================================
steps: 500
Transport       Init (s)     Steps (s)    Avg Step (s)    Total (s)   
--------------------------------------------------------------------------------
nixl            3.6700       145.9446     0.2919          149.6147    
nccl            3.6614       145.8298     0.2917          149.4911    
gloo            3.6215       144.9413     0.2899          148.5628    
object_store    3.5330       147.6611     0.2953          151.1940    
================================================================================

Init (s) - Initialization Time (seconds): This is the time spent on system initialization, including operations such as creating various Ray Actors (ReplayBuffer, Learner, Scorer, Generator), initializing model weights, and pre-filling the replay buffer. This period is not included in the training steps and represents a one-time setup cost before training begins.
Steps (s) - Total Training Steps Time (seconds): This is the total time spent across all training steps. It starts timing from the beginning of the first training step to the end of the last training step, covering the duration required to execute the specified number of complete training loops (as defined by the --steps parameter). This metric reflects the performance of the actual training process.
Avg Step (s) - Average Time Per Step (seconds): This is the average time per training step, calculated by dividing the total training time (Steps (s)) by the total number of training steps. This metric can be used to compare efficiency differences per training step under different communication methods.
Total (s) - Total Time (seconds): This is the end-to-end total duration of the entire training process, including both initialization time and training steps time. It covers the full time span from actor creation to completion of all training and resource cleanup.

… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

crypdick added the do-not-merge Do not merge this PR! label Oct 21, 2025

crypdick requested a review from stephanie-wang October 21, 2025 19:52

stephanie-wang self-assigned this Oct 21, 2025

stephanie-wang reviewed Oct 21, 2025

View reviewed changes

stephanie-wang reviewed Oct 22, 2025

View reviewed changes

crypdick force-pushed the doc/rl-rdt-contextual-bandits branch from 481cb0d to 685082b Compare October 22, 2025 04:26

Ricardo Decal added 24 commits October 24, 2025 18:35

Add reinforcement learning example illustrating gpu-to-gpu RDT and GRPO.

f167c4d

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Simpify blocking for generator update

60b43cc

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

polish

b8055e5

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

polish

aff3be9

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Rename constant

18a7ace

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Explain total variable.

3c474ae

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

PR feedback.

35f4ed4

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

PR feedback.

00860d4

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

PR feedback.

96f8444

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Simplify tqdm loop

f179cac

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

PR feedback.

b58284a

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Remap ACTION_DIM -> GROUP_SIZE

f5f3094

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Turn all comments into full sentences.

728d453

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

clarify comment

93dee0e

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

remove unnecessary nixl decorator

f994dec

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

typos

efc2bb7

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

simplify

4e0c1c3

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Remove the SignalActor since the actors are no longer async.

2d0ddee

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Improve comments.

553a9d2

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Organized constants.

f392d2d

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

organize constants

1a85bc5

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

PR feedback.

208a1c3

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Prevent memory leaks.

77f66aa

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Expand on replay buffer docstring

41249be

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

Ricardo Decal added 3 commits October 27, 2025 15:13

revert change to policy version so that the first batch has >0 weight

d7abae7

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

fix error caused by sample_from when ReplayBuffer is empty

4311cf8

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

add note about single threaded actors

105f712

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 27, 2025

View reviewed changes

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py Show resolved Hide resolved

lint

badad92

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 27, 2025

View reviewed changes

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py Show resolved Hide resolved

drop policy version from learner

cf0887a

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 27, 2025

View reviewed changes

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py Show resolved Hide resolved

fix indentation in metrics reporting

d7dcb57

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 27, 2025

View reviewed changes

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py Show resolved Hide resolved

Ricardo Decal added 2 commits October 27, 2025 16:04

rewrite while loop to avoid an extra call at each train step

cb06a0e

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

lint

cd9c8b3

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 27, 2025

View reviewed changes

crypdick commented Oct 28, 2025

View reviewed changes

doc/source/ray-core/examples/rdt/grpo_contextual_bandits.py Show resolved Hide resolved

Improve comments

097fecc

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 29, 2025

View reviewed changes

Ricardo Decal added 2 commits October 29, 2025 13:11

Revert accidental revert.

4c84547

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

lint

3c5ae46

Signed-off-by: Ricardo Decal <public@ricardodecal.com>

cursor bot reviewed Oct 29, 2025

View reviewed changes

crypdick mentioned this pull request Oct 30, 2025

[Docs] Add RL example for RDT, vLLM, FSDP2, and GRPO. #58314

Draft

Merge branch 'master' into doc/rl-rdt-contextual-bandits

d403fcc

stephanie-wang approved these changes Nov 2, 2025

View reviewed changes

stephanie-wang enabled auto-merge (squash) November 2, 2025 19:13

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 2, 2025

stephanie-wang merged commit c4f0c24 into master Nov 2, 2025
8 checks passed

stephanie-wang deleted the doc/rl-rdt-contextual-bandits branch November 2, 2025 20:24

[Docs] Add reinforcement learning example illustrating gpu-to-gpu RDT and GRPO. #57961

[Docs] Add reinforcement learning example illustrating gpu-to-gpu RDT and GRPO. #57961

Uh oh!

Conversation

crypdick commented Oct 21, 2025 • edited by stephanie-wang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanie-wang Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

crypdick Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephanie-wang Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Oct 27, 2025

Choose a reason for hiding this comment

Bug: Generator Uses Stale Weights Due to Timing

Uh oh!

Uh oh!

cursor bot Oct 29, 2025

Choose a reason for hiding this comment

Bug: Unit Vector Function Restricts Single Batch Usage

Uh oh!

cursor bot Oct 29, 2025

Choose a reason for hiding this comment

Bug: Unit Vector Sampling Restriction Error

Uh oh!

Uh oh!

daiping8 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

crypdick commented Oct 21, 2025 •

edited by stephanie-wang

Loading

daiping8 commented Nov 19, 2025 •

edited

Loading