[FlashRL 2/N] Support list of weights during weight sync for colocated training by SumanthRH · Pull Request #161 · NovaSky-AI/SkyRL

SumanthRH · 2025-08-19T01:11:01Z

What does this PR do?

Supports a list of weights during weight sync for colocated training. During colocated training, we use CUDA IPC for weight syncing. The current impl is syncing weights param by param, which can be pretty inefficient. In this PR, we sycn tensors in batches of a configurable parameter (default 1GB). That is, we collect ipc metadata until the total size of underlying tensors is 1GB and forward to the inference engine. Each TP rank will materialize all tensors in this list (i.e additional memory usage of 1GB here) and issue a single load_weights call.

How much faster is it?

Even for a 14B model on a 8xH100 node (TP2), the weight sync time can reduce from around 4.4s to 1.6s (60% reduction). This will matter much more for larger models.

This PR is needed for the FlashRL integration to work well, because we have a custom load weights impl that - long story short - allcoates new storage in each call and also issues some empty_cache calls. Without batching, the load weights call will be too slow in such cases. This PR reduces time for weight sync for a 1.5B model with flashrl from 5 mins to < 5s.

I've tested the PR with our E2E tests for colocated and non-colocated and also tested the remote engine codepath.

This PR also makes the following changes:

Fixes bug introduced in [FlashRL 1/N] Add support for truncated importance sampling #145 for the codepath with trajectory based routing when response_ids is not returned by the engine.
Fixes bug introduced in [Trainer/Algorithm] Support registering custom policy loss functions + refactor adv estimator registry to allow registration outside ray workers #126 for starting remote servers. import of skyrl_train.utils.ppo_utils will trigger registering. IN some cases, like with the vllm server init, we will not call sync_registries and there will be an error. The solution is to import guard skyrl_train.utils.ppo_utils unless the user themselves import it (for custom functions) or they go through the main entrypoint ( main -> initialize_ray-> sync)

TODO:

Verify non-colocated training works
Run e2e test

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…ht-sync Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2025-08-19T09:27:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an important optimization for weight syncing in colocated training by batching weight updates, which should significantly improve efficiency. The changes to support lists of weights in NamedWeightsUpdateRequest and the corresponding updates across the codebase are well-implemented. The PR also includes two valuable bug fixes related to trajectory-based routing and remote server initialization.

I've identified a few issues that need attention:

A critical typo in vllm_engine.py that will likely cause a TypeError during weight updates.
A high-severity bug in deepspeed_worker.py where the batching logic for CUDA IPC transfers is not correctly implemented, as the batch size is not being tracked.
A medium-severity return type mismatch in sglang_engine.py.

Addressing these points will ensure the new functionality is robust and works as expected. Overall, this is a great enhancement.

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

skyrl-train/skyrl_train/workers/deepspeed/deepspeed_worker.py

gemini-code-assist · 2025-08-19T09:28:34Z

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

+            if not success:
+                raise RuntimeError(f"Update weight request failed with message: {message}")
+            return


The function update_named_weights is type-hinted to return Tuple[bool, str], but this path returns None on success, which violates the function's contract. The IPC path correctly returns a tuple. To be consistent and correct, this should return a tuple on success, for example (True, "").

Suggested change

if not success:

raise RuntimeError(f"Update weight request failed with message: {message}")

return

if not success:

raise RuntimeError(f"Update weight request failed with message: {message}")

return True, ""

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

tyler-griggs

Nice!

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

skyrl-train/skyrl_train/workers/deepspeed/deepspeed_worker.py

skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py

skyrl-train/skyrl_train/config/ppo_base_config.yaml

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…d training (#161) # What does this PR do? Supports a list of weights during weight sync for colocated training. During colocated training, we use CUDA IPC for weight syncing. The current impl is syncing weights param by param, which can be pretty inefficient. In this PR, we sycn tensors in batches of a configurable parameter (default 1GB). That is, we collect ipc metadata until the total size of underlying tensors is 1GB and forward to the inference engine. Each TP rank will materialize all tensors in this list (i.e additional memory usage of 1GB here) and issue a single load_weights call. **How much faster is it?** Even for a 14B model on a 8xH100 node (TP2), the weight sync time can reduce from around 4.4s to 1.6s (60% reduction). This will matter much more for larger models. This PR is needed for the FlashRL integration to work well, because we have a custom load weights impl that - long story short - allcoates new storage in each call and also issues some `empty_cache` calls. Without batching, the load weights call will be too slow in such cases. This PR reduces time for weight sync for a 1.5B model with flashrl from 5 mins to < 5s. I've tested the PR with our E2E tests for colocated and non-colocated and also tested the remote engine codepath. This PR also makes the following changes: - Fixes bug introduced in #145 for the codepath with trajectory based routing when `response_ids` is not returned by the engine. - Fixes bug introduced in #126 for starting remote servers. import of `skyrl_train.utils.ppo_utils` will trigger registering. IN some cases, like with the vllm server init, we will not call `sync_registries` and there will be an error. The solution is to import guard `skyrl_train.utils.ppo_utils` unless the user themselves import it (for custom functions) or they go through the main entrypoint ( main -> `initialize_ray`-> sync) TODO: - [x] Verify non-colocated training works - [x] Run e2e test --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added 11 commits August 18, 2025 06:08

convert to list

af0f2aa

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

b9fe0e6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

1cf3f76

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

11d620a

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

e0dfa0c

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

05acf9e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

import guard to avoid triggering registry

ba28af1

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c4e6e9c

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

bc13d2b

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into sumanthrh/impr-weig…

fa85c6c

…ht-sync Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

334aa56

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review August 19, 2025 09:24

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

fix issues while porting over code

98ec111

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

tyler-griggs approved these changes Aug 19, 2025

View reviewed changes

SumanthRH added 3 commits August 19, 2025 17:20

x

a269c30

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

44c8c0b

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

4fea29a

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit d07f293 into NovaSky-AI:main Aug 19, 2025
3 checks passed

SumanthRH mentioned this pull request Aug 20, 2025

[FlashRL 3/N] Add example for FP8 training with FlashRL #169

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FlashRL 2/N] Support list of weights during weight sync for colocated training#161

[FlashRL 2/N] Support list of weights during weight sync for colocated training#161
SumanthRH merged 15 commits intoNovaSky-AI:mainfrom
SumanthRH:sumanthrh/impr-weight-sync

SumanthRH commented Aug 19, 2025 •

edited

Loading

Uh oh!

SumanthRH commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 19, 2025

Uh oh!

tyler-griggs left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SumanthRH commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

SumanthRH commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Aug 19, 2025 •

edited

Loading