-
Notifications
You must be signed in to change notification settings - Fork 7k
[Docs] Add reinforcement learning example illustrating gpu-to-gpu RDT and GRPO. #57961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| @ray.remote | ||
| class ReplayBuffer: | ||
| """Storage for scored trajectory slices.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on the docstring a bit more? For people who are not as familiar with RL.
Would be good to say how sampling is done, when samples are kicked out, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the docstrings. lmk what you think
| def __init__(self, replay_buffer) -> None: | ||
| self.model = MLP().to("cuda") | ||
|
|
||
| # Maintain a frozen EMA teacher of the policy for KL computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought original PPO paper suggested not using the KL divergence term?
I'd prefer to remove it assuming this particular example still works. Have you tried running without it to see if it will converge? If that doesn't work, the 1 weight update step behind seems simpler.
481cb0d to
685082b
Compare
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
| # Update the generator with new weights. | ||
| weights_updated_ref = generator.update_weights.remote( | ||
| learner.get_weights.remote() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Generator Uses Stale Weights Due to Timing
Race condition in the training loop: generator.generate.remote(states) is called at line 385 before ray.wait([weights_updated_ref]) at line 390. This means that in each iteration (except the first), the generate call is queued on the single-threaded Generator actor before confirming that the previous weight update has completed. While the actor processes tasks sequentially, the generate task gets queued before the ray.wait confirms the update is done, potentially causing the generator to use stale weights. The generate call should be moved to after the ray.wait to ensure the generator always uses the most recent weights.
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
| # -- Utilities -- | ||
| def sample_unit_vector(batch_size: int, dim: int = STATE_DIM) -> torch.Tensor: | ||
| """Sample unit vectors of shape [batch_size, dim] by normalizing Gaussian draws.""" | ||
| assert batch_size > 1, "Batch size must be greater than 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Unit Vector Function Restricts Single Batch Usage
The sample_unit_vector function's assertion batch_size > 1 is overly restrictive. The vector normalization logic works correctly for batch_size=1, unnecessarily limiting the function's utility and potentially causing failures in valid use cases.
Signed-off-by: Ricardo Decal <public@ricardodecal.com>
| # -- Utilities -- | ||
| def sample_unit_vector(batch_size: int, dim: int = STATE_DIM) -> torch.Tensor: | ||
| """Sample unit vectors of shape [batch_size, dim] by normalizing Gaussian draws.""" | ||
| assert batch_size > 1, "Batch size must be greater than 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Unit Vector Sampling Restriction Error
The sample_unit_vector function includes an assertion batch_size > 1 that is overly restrictive. The function's logic correctly handles batch_size=1 for sampling a single unit vector, so this assertion prevents valid usage without an apparent algorithmic or mathematical reason.
… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
|
Hello!I tested the code on a cluster with 1 head and 1 worker node, each equipped with an A800 GPU. After testing different transport backends, I obtained the following results. According to the data, there is no significant difference in execution time among the various transport backends. Is this situation normal? Thank you !!!
|
… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers.