Skip to content

Fix xLAM-70b OOM: reduce gpu_memory_utilization#178

Merged
dzorlu merged 1 commit intomainfrom
fix/xlam-70b-oom
Feb 14, 2026
Merged

Fix xLAM-70b OOM: reduce gpu_memory_utilization#178
dzorlu merged 1 commit intomainfrom
fix/xlam-70b-oom

Conversation

@dzorlu
Copy link
Collaborator

@dzorlu dzorlu commented Feb 14, 2026

Problem

xLAM-70b training OOM during backward pass:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.37 GiB.

OOM in MLP layer during gradient checkpointing recomputation - activation memory too high.

Fix

  • Reduce policy_mini_batch_size from 16 → 8
  • Reduces gradient accumulation steps and peak activation memory

Note: gradient_checkpointing is already enabled by default in ppo_base_config.yaml.

Test plan

  • Re-launch xLAM-70b training
  • Verify step 1 completes without OOM

🤖 Generated with Claude Code

v0.1.1: OOM during backward pass due to activation memory.
Reduce policy_mini_batch_size to decrease gradient accumulation steps
and peak activation memory during backward pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dzorlu dzorlu merged commit 9e2b322 into main Feb 14, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments