Skip to content

Conversation

@crypdick
Copy link
Contributor

@crypdick crypdick commented Oct 21, 2025

Description

Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers.

@crypdick crypdick added the do-not-merge Do not merge this PR! label Oct 21, 2025
@stephanie-wang stephanie-wang self-assigned this Oct 21, 2025

@ray.remote
class ReplayBuffer:
"""Storage for scored trajectory slices."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on the docstring a bit more? For people who are not as familiar with RL.

Would be good to say how sampling is done, when samples are kicked out, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the docstrings. lmk what you think

def __init__(self, replay_buffer) -> None:
self.model = MLP().to("cuda")

# Maintain a frozen EMA teacher of the policy for KL computation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought original PPO paper suggested not using the KL divergence term?

I'd prefer to remove it assuming this particular example still works. Have you tried running without it to see if it will converge? If that doesn't work, the 1 weight update step behind seems simpler.

@crypdick crypdick force-pushed the doc/rl-rdt-contextual-bandits branch from 481cb0d to 685082b Compare October 22, 2025 04:26
Ricardo Decal added 24 commits October 24, 2025 18:35
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Ricardo Decal added 3 commits October 27, 2025 15:13
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Ricardo Decal added 2 commits October 27, 2025 16:04
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
# Update the generator with new weights.
weights_updated_ref = generator.update_weights.remote(
learner.get_weights.remote()
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Generator Uses Stale Weights Due to Timing

Race condition in the training loop: generator.generate.remote(states) is called at line 385 before ray.wait([weights_updated_ref]) at line 390. This means that in each iteration (except the first), the generate call is queued on the single-threaded Generator actor before confirming that the previous weight update has completed. While the actor processes tasks sequentially, the generate task gets queued before the ray.wait confirms the update is done, potentially causing the generator to use stale weights. The generate call should be moved to after the ray.wait to ensure the generator always uses the most recent weights.

Fix in Cursor Fix in Web

Signed-off-by: Ricardo Decal <public@ricardodecal.com>
# -- Utilities --
def sample_unit_vector(batch_size: int, dim: int = STATE_DIM) -> torch.Tensor:
"""Sample unit vectors of shape [batch_size, dim] by normalizing Gaussian draws."""
assert batch_size > 1, "Batch size must be greater than 1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unit Vector Function Restricts Single Batch Usage

The sample_unit_vector function's assertion batch_size > 1 is overly restrictive. The vector normalization logic works correctly for batch_size=1, unnecessarily limiting the function's utility and potentially causing failures in valid use cases.

Fix in Cursor Fix in Web

Ricardo Decal added 2 commits October 29, 2025 13:11
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
# -- Utilities --
def sample_unit_vector(batch_size: int, dim: int = STATE_DIM) -> torch.Tensor:
"""Sample unit vectors of shape [batch_size, dim] by normalizing Gaussian draws."""
assert batch_size > 1, "Batch size must be greater than 1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unit Vector Sampling Restriction Error

The sample_unit_vector function includes an assertion batch_size > 1 that is overly restrictive. The function's logic correctly handles batch_size=1 for sampling a single unit vector, so this assertion prevents valid usage without an apparent algorithmic or mathematical reason.

Fix in Cursor Fix in Web

@stephanie-wang stephanie-wang enabled auto-merge (squash) November 2, 2025 19:13
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 2, 2025
@stephanie-wang stephanie-wang merged commit c4f0c24 into master Nov 2, 2025
8 checks passed
@stephanie-wang stephanie-wang deleted the doc/rl-rdt-contextual-bandits branch November 2, 2025 20:24
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
… and GRPO. (ray-project#57961)

## Description
Example for first blog in the RDT series using NIXL for GPU-GPU tensor
transfers.

---------

Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Ricardo Decal <public@ricardodecal.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… and GRPO. (ray-project#57961)

## Description
Example for first blog in the RDT series using NIXL for GPU-GPU tensor
transfers.

---------

Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Ricardo Decal <public@ricardodecal.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
@daiping8
Copy link
Contributor

daiping8 commented Nov 19, 2025

Hello!I tested the code on a cluster with 1 head and 1 worker node, each equipped with an A800 GPU. After testing different transport backends, I obtained the following results. According to the data, there is no significant difference in execution time among the various transport backends. Is this situation normal? Thank you !!!

================================================================================
SUMMARY RESULTS
================================================================================
steps: 500
Transport       Init (s)     Steps (s)    Avg Step (s)    Total (s)   
--------------------------------------------------------------------------------
nixl            3.6700       145.9446     0.2919          149.6147    
nccl            3.6614       145.8298     0.2917          149.4911    
gloo            3.6215       144.9413     0.2899          148.5628    
object_store    3.5330       147.6611     0.2953          151.1940    
================================================================================
  • Init (s) - Initialization Time (seconds): This is the time spent on system initialization, including operations such as creating various Ray Actors (ReplayBuffer, Learner, Scorer, Generator), initializing model weights, and pre-filling the replay buffer. This period is not included in the training steps and represents a one-time setup cost before training begins.
  • Steps (s) - Total Training Steps Time (seconds): This is the total time spent across all training steps. It starts timing from the beginning of the first training step to the end of the last training step, covering the duration required to execute the specified number of complete training loops (as defined by the --steps parameter). This metric reflects the performance of the actual training process.
  • Avg Step (s) - Average Time Per Step (seconds): This is the average time per training step, calculated by dividing the total training time (Steps (s)) by the total number of training steps. This metric can be used to compare efficiency differences per training step under different communication methods.
  • Total (s) - Total Time (seconds): This is the end-to-end total duration of the entire training process, including both initialization time and training steps time. It covers the full time span from actor creation to completion of all training and resource cleanup.

Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
… and GRPO. (ray-project#57961)

## Description
Example for first blog in the RDT series using NIXL for GPU-GPU tensor
transfers.

---------

Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Ricardo Decal <public@ricardodecal.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants