Fix xLAM-70b OOM: reduce gpu_memory_utilization by dzorlu · Pull Request #178 · fleet-ai/SkyRL

dzorlu · 2026-02-14T05:33:47Z

Problem

xLAM-70b training OOM during backward pass:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.37 GiB.

OOM in MLP layer during gradient checkpointing recomputation - activation memory too high.

Fix

Reduce policy_mini_batch_size from 16 → 8
Reduces gradient accumulation steps and peak activation memory

Note: gradient_checkpointing is already enabled by default in ppo_base_config.yaml.

Test plan

Re-launch xLAM-70b training
Verify step 1 completes without OOM

🤖 Generated with Claude Code

v0.1.1: OOM during backward pass due to activation memory. Reduce policy_mini_batch_size to decrease gradient accumulation steps and peak activation memory during backward pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dzorlu force-pushed the fix/xlam-70b-oom branch from ea04dc8 to a652b7f Compare February 14, 2026 05:40

dzorlu merged commit 9e2b322 into main Feb 14, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix xLAM-70b OOM: reduce gpu_memory_utilization#178

Fix xLAM-70b OOM: reduce gpu_memory_utilization#178
dzorlu merged 1 commit intomainfrom
fix/xlam-70b-oom

dzorlu commented Feb 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

dzorlu commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

dzorlu commented Feb 14, 2026 •

edited

Loading