Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ For more information, see [Speeding up training with vLLM](speeding_up_training#
#### Dealing with the Training-Inference Mismatch
While vLLM greatly accelerates inference, it also decouples the inference engine from the training engine. In theory these engines are mathematically identical, in practice however they can produce different outputs due to precision effects and hardware specific optimizations. This divergence reflects the different optimization objectives of the two systems. This divergence reflects the distinct optimization goals of the two systems. Inference engines aim to maximize sampling throughput, typically measured in tokens per second, while maintaining acceptable sampling fidelity. Training frameworks instead focus on numerical stability and precision for gradient computation, often using higher precision formats like FP32 for master weights and optimizer states. These differing priorities and constraints introduce an inevitable, albeit subtle, mismatch between training and inference.

This mismatch leads to a biased gradient update which has been observed to destabilize training ([[1]](https://fengyao.notion.site/off-policy-rl)[[2]](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda)[[3]](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/#true-on-policy-rl)[[4]](https://arxiv.org/abs/2510.26788)[[5]](https://arxiv.org/abs/2510.18855)). For simplicity, consider the REINFORCE policy gradient:
This mismatch leads to a biased gradient update which has been observed to destabilize training ([[1]](https://fengyao.notion.site/off-policy-rl)[[2]](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda)[[3]](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/#true-on-policy-rl)[[4]](https://huggingface.co/papers/2510.26788)[[5]](https://huggingface.co/papers/2510.18855)). For simplicity, consider the REINFORCE policy gradient:

$$
\nabla_\theta \mathcal{J}(x,\theta)
Expand All @@ -269,7 +269,7 @@ $$

Under MIS, ratios larger than `vllm_importance_sampling_cap` are set to zero, so those samples do not contribute to the gradient. In other words, large ratio samples are downweighted under TIS and discarded under MIS. The configuration flag `vllm_importance_sampling_mode` chooses both the IS variant (masking or truncation) and the granularity (token level or sequence level).

Importance sampling is the principled algorithmic response to the training–inference mismatch. However, there are also more direct approaches that attempt to reduce the mismatch between the two engines themselves. Most of these are engineering solutions. For example, [MiniMax M1 uses an FP32 language model head](https://arxiv.org/abs/2506.13585) in the inference engine. Thinking Machines has explored [deterministic inference kernels](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), although this comes with a significant efficiency cost. vLLM has shown [bitwise consistent policies](https://blog.vllm.ai/2025/11/10/bitwise-consistent-train-inference.html) by building on the batch invariant deterministic kernels from Thinking Machines, but as of November 2025 there remains a substantial throughput penalty relative to standard vLLM inference.
Importance sampling is the principled algorithmic response to the training–inference mismatch. However, there are also more direct approaches that attempt to reduce the mismatch between the two engines themselves. Most of these are engineering solutions. For example, [MiniMax M1 uses an FP32 language model head](https://huggingface.co/papers/2506.13585) in the inference engine. Thinking Machines has explored [deterministic inference kernels](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), although this comes with a significant efficiency cost. vLLM has shown [bitwise consistent policies](https://blog.vllm.ai/2025/11/10/bitwise-consistent-train-inference.html) by building on the batch invariant deterministic kernels from Thinking Machines, but as of November 2025 there remains a substantial throughput penalty relative to standard vLLM inference.

### GRPO at scale: train a 70B+ Model on multiple nodes

Expand Down
6 changes: 3 additions & 3 deletions docs/source/peft_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -822,6 +822,6 @@ model = AutoModelForCausalLM.from_pretrained(

### Research Papers

- [LoRA Paper](https://arxiv.org/abs/2106.09685) - Original LoRA methodology and results
- [QLoRA Paper](https://arxiv.org/abs/2305.14314) - Efficient finetuning with 4-bit quantization
- [Prompt Tuning Paper](https://arxiv.org/abs/2104.08691) - The Power of Scale for Parameter-Efficient Prompt Tuning
- [LoRA Paper](https://huggingface.co/papers/2106.09685) - Original LoRA methodology and results
- [QLoRA Paper](https://huggingface.co/papers/2305.14314) - Efficient finetuning with 4-bit quantization
- [Prompt Tuning Paper](https://huggingface.co/papers/2104.08691) - The Power of Scale for Parameter-Efficient Prompt Tuning
2 changes: 1 addition & 1 deletion trl/data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -602,7 +602,7 @@ class _SegmentTree:
A segment tree data structure that, when initialized as `_SegmentTree(maxval)`, efficiently finds the next larger
value for a given input within the range [1, maxval].

See [Fewer Truncations Improve Language Modeling](https://arxiv.org/abs/2404.10830) for more details.
See [Fewer Truncations Improve Language Modeling](https://huggingface.co/papers/2404.10830) for more details.
"""

def __init__(self, maxval: int):
Expand Down
2 changes: 1 addition & 1 deletion trl/trainer/grpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ class GRPOConfig(TrainingArguments):
Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound
specified in argument `epsilon`. Paper [DAPO](https://huggingface.co/papers/2503.14476) recommends `0.28`.
When used with `loss_type='cispo'`, this corresponds to the ε_max param specified in the [ScaleRL
paper](https://arxiv.org/pdf/2510.13786) and the recommended value is `5.0`.
paper](https://huggingface.co/papers/2510.13786) and the recommended value is `5.0`.
sapo_temperature_neg (`float`, *optional*, defaults to `1.05`):
Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. This parameter
is introduced in the [Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347).
Expand Down
Loading