Skip to content

Support additional evaluation frameworks #283

@bigximik

Description

@bigximik

🎯 Goal (What & Why)

The goal is to support the most important evaluation suites for our experiments. Ideally, we would define a unified API—an improved version of the lm_eval_harness model wrapper from #282—and let contributors extend the integration as needed.

The main challenge is that, in order to enable evaluation during training across different frameworks, we need to pass the model in memory to the target evaluation framework, rather than saving and reloading it from disk. This approach avoids changes in memory allocation and allows training to seamlessly resume after an evaluation step. We have already implemented such a wrapper for lm_eval_harness in #282.

Next steps:

  • Identify several key evaluation frameworks important to us.
  • Evaluate whether we can design a unified interface to integrate them as described above.
  • If a common interface isn't feasible, we may need to integrate each framework individually.

🚀 Execution Plan

(This section may start as an incomplete draft but must be defined before implementation begins.)

Step 1: What is the smallest working version?

(Describe the simplest way to implement this feature with minimal effort.)

Step 2: What additional optimizations are possible (but optional)?

(List potential refinements that can be added in later PRs if needed.)

📌 Acceptance Criteria (Must-Haves for Completion)

  • The feature must be functional and tested.
  • The implementation must be documented in practical terms.
  • The PR must include a performance/impact summary.
  • No refactors unless directly necessary for feature completion.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions