Issues on current evaluation process

### Summary

During my experiments, I've observed that the evaluation process reuses the same workflow as the exploration/rollout phase. This practice may compromise the objectivity of the evaluation results, leading to discrepancies with third-party benchmarks and potentially misrepresenting the model's true capabilities.

### Experimental Setup & Initial Findings

My key experimental configuration is as follows:

```yaml
algorithm:
  algorithm_type: grpo
  repeat_times: 8

model:
  model_path: '/PATH/Qwen2.5-1.5B-Instruct'

buffer:
  total_epochs: 1
  batch_size: 96
  explorer_input:
    taskset:
      name: [hendrycks/math](https://github.com/hendrycks/math)
      rollout_args:
        temperature: 1.0
    eval_tasksets:
      name: math500
    default_workflow_type: 'math_workflow'
  
explorer:
  eval_interval: 12
synchronizer:
  sync_interval: 3

```

Under this setup, the MATH500 accuracy reported by Trinity's internal evaluation is shown below:

<img width="2806" height="710" alt="Image" src="https://github.com/user-attachments/assets/baf21b64-658e-48e2-9257-dfaa5be3d4e9" />

### Discrepancy with Third-Party Benchmarks

To assess the model's general capabilities, I used a mainstream, objective [third-party framework](https://github.com/hkust-nlp/simpleRL-reason?tab=readme-ov-file#evaluate) for comparison. The result is shown as follows, it is worth noticing that the **raw model's MATH500 accuracy was 0.538** (aligns with official technical reports and recent research), and the performance **has a degradation** during training, which is significantly different from the results in Trinity-evaluation (~0.2 of raw model).

| model                    | gsm8k_acc | math500_acc | minerva_math_acc | olympiadbench_acc | avg_acc  |
|--------------------------|-----------|-------------|------------------|-------------------|----------|
| global_step_0 | 73        | 53.8        | 18               | 19.9              | 41.175   |
| global_step_20 | 73       | 52.4        | 15.3             | 19.1              | 39.95    |
| global_step_40 | 74.5     | 50.1        | 15.1             | 18.4              | 39.525   |
| global_step_60 | 73.7     | 51.2        | 15.9             | 18.5              | 39.825   |

### Hypothesis & Verification

***Hypothesis*** The discrepancy likely arises because Trinity's evaluation reuses the `math-workflow` from the rollout process. Consequently, the evaluation prompts are identical to the training prompts. The observed accuracy gains in the internal evaluation are likely due to the model **overfitting to this specific prompt template**, rather than a genuine improvement in its mathematical reasoning abilities.

***Verification*** To test this hypothesis, I replaced the original `math-workflow` template with the [official evaluation prompt template for Qwen2.5](https://github.com/QwenLM/Qwen2.5-Math/blob/a45202bd16f1ec06f433442dc1152d0074773465/evaluation/utils.py#L134) and re-ran the experiment. The results of this new experiment are shown below. The baseline model's score is now ~0.42, which is much closer to the third-party benchmark result.

![Image](https://github.com/user-attachments/assets/329ce745-3383-4e78-a940-358a7e681f8d)

### Potential Suggestions & WIP

- Keep the current workflow-specific evaluation to track training adaptation.
- Add a **new, independent evaluation workflow** that uses a standard, objective prompt template for measuring true benchmark performance.
- I'm also exploring integrating the third-party evaluation framework into Trinity directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues on current evaluation process #119

Summary

Experimental Setup & Initial Findings

Discrepancy with Third-Party Benchmarks

Hypothesis & Verification

Potential Suggestions & WIP

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	gsm8k_acc	math500_acc	minerva_math_acc	olympiadbench_acc	avg_acc
global_step_0	73	53.8	18	19.9	41.175
global_step_20	73	52.4	15.3	19.1	39.95
global_step_40	74.5	50.1	15.1	18.4	39.525
global_step_60	73.7	51.2	15.9	18.5	39.825

Issues on current evaluation process #119

Description

Summary

Experimental Setup & Initial Findings

Discrepancy with Third-Party Benchmarks

Hypothesis & Verification

Potential Suggestions & WIP

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions