Skip to content

Issues on current evaluation process #119

@lingzhq

Description

@lingzhq

Summary

During my experiments, I've observed that the evaluation process reuses the same workflow as the exploration/rollout phase. This practice may compromise the objectivity of the evaluation results, leading to discrepancies with third-party benchmarks and potentially misrepresenting the model's true capabilities.

Experimental Setup & Initial Findings

My key experimental configuration is as follows:

algorithm:
  algorithm_type: grpo
  repeat_times: 8

model:
  model_path: '/PATH/Qwen2.5-1.5B-Instruct'

buffer:
  total_epochs: 1
  batch_size: 96
  explorer_input:
    taskset:
      name: [hendrycks/math](https://github.com/hendrycks/math)
      rollout_args:
        temperature: 1.0
    eval_tasksets:
      name: math500
    default_workflow_type: 'math_workflow'
  
explorer:
  eval_interval: 12
synchronizer:
  sync_interval: 3

Under this setup, the MATH500 accuracy reported by Trinity's internal evaluation is shown below:

Image

Discrepancy with Third-Party Benchmarks

To assess the model's general capabilities, I used a mainstream, objective third-party framework for comparison. The result is shown as follows, it is worth noticing that the raw model's MATH500 accuracy was 0.538 (aligns with official technical reports and recent research), and the performance has a degradation during training, which is significantly different from the results in Trinity-evaluation (~0.2 of raw model).

model gsm8k_acc math500_acc minerva_math_acc olympiadbench_acc avg_acc
global_step_0 73 53.8 18 19.9 41.175
global_step_20 73 52.4 15.3 19.1 39.95
global_step_40 74.5 50.1 15.1 18.4 39.525
global_step_60 73.7 51.2 15.9 18.5 39.825

Hypothesis & Verification

Hypothesis The discrepancy likely arises because Trinity's evaluation reuses the math-workflow from the rollout process. Consequently, the evaluation prompts are identical to the training prompts. The observed accuracy gains in the internal evaluation are likely due to the model overfitting to this specific prompt template, rather than a genuine improvement in its mathematical reasoning abilities.

Verification To test this hypothesis, I replaced the original math-workflow template with the official evaluation prompt template for Qwen2.5 and re-ran the experiment. The results of this new experiment are shown below. The baseline model's score is now ~0.42, which is much closer to the third-party benchmark result.

Image

Potential Suggestions & WIP

  • Keep the current workflow-specific evaluation to track training adaptation.
  • Add a new, independent evaluation workflow that uses a standard, objective prompt template for measuring true benchmark performance.
  • I'm also exploring integrating the third-party evaluation framework into Trinity directly.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions