-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Summary
During my experiments, I've observed that the evaluation process reuses the same workflow as the exploration/rollout phase. This practice may compromise the objectivity of the evaluation results, leading to discrepancies with third-party benchmarks and potentially misrepresenting the model's true capabilities.
Experimental Setup & Initial Findings
My key experimental configuration is as follows:
algorithm:
algorithm_type: grpo
repeat_times: 8
model:
model_path: '/PATH/Qwen2.5-1.5B-Instruct'
buffer:
total_epochs: 1
batch_size: 96
explorer_input:
taskset:
name: [hendrycks/math](https://github.com/hendrycks/math)
rollout_args:
temperature: 1.0
eval_tasksets:
name: math500
default_workflow_type: 'math_workflow'
explorer:
eval_interval: 12
synchronizer:
sync_interval: 3
Under this setup, the MATH500 accuracy reported by Trinity's internal evaluation is shown below:
Discrepancy with Third-Party Benchmarks
To assess the model's general capabilities, I used a mainstream, objective third-party framework for comparison. The result is shown as follows, it is worth noticing that the raw model's MATH500 accuracy was 0.538 (aligns with official technical reports and recent research), and the performance has a degradation during training, which is significantly different from the results in Trinity-evaluation (~0.2 of raw model).
| model | gsm8k_acc | math500_acc | minerva_math_acc | olympiadbench_acc | avg_acc |
|---|---|---|---|---|---|
| global_step_0 | 73 | 53.8 | 18 | 19.9 | 41.175 |
| global_step_20 | 73 | 52.4 | 15.3 | 19.1 | 39.95 |
| global_step_40 | 74.5 | 50.1 | 15.1 | 18.4 | 39.525 |
| global_step_60 | 73.7 | 51.2 | 15.9 | 18.5 | 39.825 |
Hypothesis & Verification
Hypothesis The discrepancy likely arises because Trinity's evaluation reuses the math-workflow from the rollout process. Consequently, the evaluation prompts are identical to the training prompts. The observed accuracy gains in the internal evaluation are likely due to the model overfitting to this specific prompt template, rather than a genuine improvement in its mathematical reasoning abilities.
Verification To test this hypothesis, I replaced the original math-workflow template with the official evaluation prompt template for Qwen2.5 and re-ran the experiment. The results of this new experiment are shown below. The baseline model's score is now ~0.42, which is much closer to the third-party benchmark result.
Potential Suggestions & WIP
- Keep the current workflow-specific evaluation to track training adaptation.
- Add a new, independent evaluation workflow that uses a standard, objective prompt template for measuring true benchmark performance.
- I'm also exploring integrating the third-party evaluation framework into Trinity directly.
