[skyrl-train] Refactor training loop structure to explicitly batch at two levels (minibatch -> microbatch) #817

justinvyu · 2025-12-31T00:08:32Z

Summary

This PR refactors the training loop structure in both PolicyWorkerBase and CriticWorkerBase to use a consistent two-level batching strategy (minibatch → microbatch) with a microbatch_weight used for gradient accumulation.

Refactored ppo_train to use a two-level loop: iterate over minibatches, then subdivide each minibatch into microbatches

Changed forward_backward signature from accumulation_steps: int to microbatch_weight: float
Loss is now scaled by microbatch_weight (i.e., 1.0 / num_microbatches) instead of dividing by accumulation_steps
Optimizer step is now called once per minibatch (after all microbatches are processed), rather than conditionally based on step count

Motivation

The motivation for this PR was for batch balancing which I tried adding in this PR: #640

The problem was that minibatch boundaries are not explicitly defined right now. We have 2 configurations policy_mini_batch_size and micro_train_batch_size_per_gpu, and the minibatch is implicitly constructed by doing gradient accumulation over (policy_mini_batch_size // micro_train_batch_size_per_gpu) identical microbatches. #640 breaks the "same sized microbatches" assumption, which is why we need to partition explicitly at the minibatch level first, before chunking into possibly uneven microbatches. That way, it's simpler to tell when to stop accumulating the gradient.

Also, the introduction of a more general microbatch_weight is also motivated by dynamic microbatch sizes introduced by #640. Each microbatch should contribute N_i / sum(N_j) * loss_i to the accumulated gradient. In the default case, this is just 1/accumulation_steps.

Testing

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the training loop to use a two-level batching strategy (minibatch → microbatch), which is a good architectural improvement for supporting uneven microbatches in the future. The core logic changes are sound, but I've identified a few critical issues. The most significant one is that in both policy and critic training loops, metrics are not being aggregated correctly across microbatches, leading to inaccurate logging. Only the status from the last microbatch of a minibatch is being recorded. Additionally, the optimizer step has been unintentionally removed from the critic's training_step method, which will affect tests relying on it. I've provided detailed comments and suggestions to address these points.

skyrl-train/skyrl_train/workers/worker.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

skyrl-train/skyrl_train/workers/worker.py

justinvyu · 2025-12-31T00:14:32Z

skyrl-train/skyrl_train/workers/worker.py

+                microbatch_iterator = BatchIterator(
+                    minibatch, sample_batch_size=self.cfg.trainer.micro_train_batch_size_per_gpu, drop_last=False
+                )
+                num_microbatches = len(microbatch_iterator)
+                microbatch_weight = 1.0 / num_microbatches
+
+                for microbatch in microbatch_iterator:


The next step is basically just to change this microbatch iterator from one that's doing sample-based chunking to one that's token-based.

erictang000 · 2025-12-31T01:18:32Z

skyrl-train/skyrl_train/workers/worker_utils.py

        return self

-    def __next__(self) -> Experience:
+    def __next__(self) -> TrainingInputBatch:


this change also affects the use of BatchIterator for the megatron backend, which implements ppo_train differently FSDP/Deepspeed.

SkyRL/skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

Line 514 in 3725635

dataloader = BatchIterator(

could you make sure the conversion to experience is also handled correctly for the megatron code path? Making sure one of these tests pass:

SkyRL/skyrl-train/tests/gpu/test_megatron_worker.py

Line 440 in 3725635

async def test_megatron_train(

is probably a good way to check this.

Sounds good. I added a followup TODO to update the megatron worker's ppo loop as well. Made a minimal change for now to prevent this PR from getting too large.

erictang000

left a comment about megatron + i think the gemini comments are worth a look - essentially we want to make sure that the metrics are aggregated in the same way before and after this PR. Maybe it would be nice to show with running a test before and after this PR (maybe this guy:

SkyRL/skyrl-train/tests/gpu/gpu_ci/test_ppo_train.py

Line 33 in 3725635

    
           def test_ppo_train_basic_execution(ray_init_fixture, cfg, use_entropy_loss, use_kl_loss):

) that this PR doesn't change the metrics

…atch_refactor Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…atch_refactor Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2026-01-05T23:36:54Z

To sanity check that this PR didn't introduce any regressions for metrics, I printed the output status from this test: pytest tests/gpu/gpu_ci/test_ppo_train.py::test_gradient_accumulation_scenarios[accumulation_calculation] -s

Printed status on master:

train_status={'final_loss': 0.0018931262311525643, 'policy_loss': -0.0024274957249872386, 'ppo_clip_ratio': 0.0, 'policy_entropy': 5.9921875, 'policy_kl': 10.0, 'policy_lr': 9.999999974752427e-07, 'raw_grad_norm': 0.33137789368629456, 'policy_update_steps': 1.0}, actual_optimizer_steps=1.0

Printed status with this PR:

train_status={'final_loss': 0.0018931262311525643, 'policy_loss': -0.0024274957249872386, 'ppo_clip_ratio': 0.0, 'policy_entropy': 5.9921875, 'policy_kl': 10.0, 'policy_lr': 9.999999974752427e-07, 'raw_grad_norm': 0.33137789368629456, 'policy_update_steps': 1}, actual_optimizer_steps=1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2026-01-06T00:35:53Z

Tested Megatron code path as well:
uv run --isolated --extra dev --extra mcore -- pytest tests/gpu/gpu_ci/test_megatron_worker.py::test_megatron_train[tp2_pp2_policy_seq_packing]


megatron results:  {'final_loss': -0.021974159637466073, 'policy_loss': -0.030495601160509977, 'policy_lr': 9.999999974752427e-07, 'ppo_clip_ratio': 0.0, 'policy_entropy': 8.893153190612793, 'policy_kl': 8.521441221237183, 'raw_grad_norm': 4.0175676345825195, 'policy_update_steps': 1}

fsdp results:  {'final_loss': -0.021487861638888717, 'policy_loss': -0.030021082626679796, 'ppo_clip_ratio': 0.0, 'policy_entropy': 0.5429687462747097, 'policy_kl': 8.533220887184143, 'raw_grad_norm': 3.5146021320670116, 'policy_lr': 9.999999974752427e-07, 'policy_update_steps': 4}

erictang000 · 2026-01-06T02:02:33Z

/gemini review

gemini-code-assist

Code Review

This PR refactors the training loops in PolicyWorkerBase and CriticWorkerBase to use a two-level batching structure (minibatch -> microbatch), which is a great improvement for supporting batch balancing and dynamic microbatch sizes. The changes are well-motivated and mostly well-executed.

My feedback focuses on a few key areas:

An inconsistency in megatron_worker.py where the new training loop structure has not been applied.
A potential bug in the calculation of critic_update_steps.
Several opportunities for minor refactoring to improve code clarity and reduce duplication, such as extracting helper methods for status recording and memory snapshotting.
Identifying potentially unused code marked with TODO comments that should be cleaned up.

Overall, this is a solid refactoring. Addressing these points will improve the consistency and maintainability of the codebase.

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

skyrl-train/skyrl_train/workers/worker.py

erictang000

looking good! I think just one super minor gemini comment, and we need to fix the failing cpu test and we should be good to merge this

cpu test:

=========================== short test summary info ============================
FAILED tests/cpu/test_trainer.py::test_ppo_train_batch_calculations - TypeError: test_ppo_train_batch_calculations.<locals>.mock_policy_forward_backward() got an unexpected keyword argument 'microbatch_weight'
============ 1 failed, 353 passed, 83 warnings in 101.86s (0:01:41) ============

erictang000 · 2026-01-06T01:59:36Z

skyrl-train/skyrl_train/workers/worker.py


        status_mean = reduce_metrics(all_metrics)
-        status_mean["policy_update_steps"] = policy_update_steps / accumulation_steps
+        status_mean["policy_update_steps"] = num_minibatches * self.cfg.trainer.update_epochs_per_batch


good catch...

skyrl-train/skyrl_train/workers/worker.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…atch_refactor Signed-off-by: Justin Yu <justinvyu@anyscale.com>

erictang000

🚀🚀🚀

justinvyu added 4 commits December 30, 2025 15:56

update ppo_train to have 2 levels of batch iteration (mini and micro)

87fcd3f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

do for critic base as well

990cb1a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix training_step

94b5983

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove comment

108d960

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

skyrl-train/skyrl_train/workers/worker.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/workers/worker.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/workers/worker.py Outdated Show resolved Hide resolved

add back optim step

d301ebf

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Dec 31, 2025

View reviewed changes

justinvyu changed the title ~~[skyrl-train] Refactor training loop structure to explicitly batch at two-levels (minibatch -> microbatch)~~ [skyrl-train] Refactor training loop structure to explicitly batch at two levels (minibatch -> microbatch) Dec 31, 2025

erictang000 reviewed Dec 31, 2025

View reviewed changes

justinvyu added 5 commits December 31, 2025 16:25

Merge branch 'main' of https://github.com/NovaSky-AI/SkyRL into minib…

c9ad4c9

…atch_refactor Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add back status updates per microbatch

e3b9b08

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix for critic as well

5038324

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update megatron

a31d8a4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'main' of https://github.com/NovaSky-AI/SkyRL into minib…

25d30a1

…atch_refactor Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix total policy update steps calculation

c5fbc25

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from erictang000 January 6, 2026 00:37

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

erictang000 reviewed Jan 6, 2026

View reviewed changes

justinvyu added 3 commits January 6, 2026 10:04

fix critic update steps

74dade4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix text + add some docstring

6c8212d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'main' of https://github.com/NovaSky-AI/SkyRL into minib…

70776f2

…atch_refactor Signed-off-by: Justin Yu <justinvyu@anyscale.com>

erictang000 approved these changes Jan 6, 2026

View reviewed changes

erictang000 merged commit 2a7a572 into NovaSky-AI:main Jan 6, 2026
3 checks passed

justinvyu deleted the minibatch_refactor branch January 6, 2026 20:00

justinvyu mentioned this pull request Jan 6, 2026

Add a configuration for max tokens per microbatch #640

Closed

erictang000 mentioned this pull request Jan 12, 2026

[Tinker] Refactor trainer and worker (to move algo to trainer and infra to worker) #859

Merged

4 tasks

[skyrl-train] Refactor training loop structure to explicitly batch at two levels (minibatch -> microbatch) #817

[skyrl-train] Refactor training loop structure to explicitly batch at two levels (minibatch -> microbatch) #817

Uh oh!

Conversation

justinvyu commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

erictang000 Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu commented Jan 5, 2026

Uh oh!

justinvyu commented Jan 6, 2026

Uh oh!

erictang000 commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erictang000 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictang000 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinvyu commented Dec 31, 2025 •

edited

Loading

erictang000 Dec 31, 2025 •

edited

Loading

erictang000 left a comment •

edited

Loading