Dear Authors,
First, thank you for your excellent work! In the paper, you mentioned using a small amount of data for supervised fine-tuning (SFT) warm-up before the RL phase. Could you kindly clarify:
- Data Scale: What is the specific data size used for the SFT warm-up phase?
- Proportion to RL Data: What percentage does this warm-up data account for relative to the total data used in the RL training stage?
- Impact Analysis: How does the ratio of SFT-to-RL data (whether higher or lower) affect the final RL evaluation metrics? Have you observed any notable patterns or thresholds in your experiments?
This clarification would greatly help readers understand the relationship between warm-up strategies and RL optimization.
Thank you for your time and insights!