Skip to content

Run lm-eval-harness benchmarks during validation #199

@bigximik

Description

@bigximik

🎯 Goal (What & Why)

Enable Fast-LLM to run structured evaluations using lm-eval-harness.
This allows benchmarking Fast-LLM models across many standard tasks using the in-memory model during validation, leveraging the existing HuggingFace-compatible interface improved in #217.

Note that the current HuggingfaceGPTModelForCausalLM.from_pretrained(...) API always reloads the model from disk. This breaks the intended workflow, where we keep the model sharded and in memory across all GPUs. We want to integrate with lm-eval-harness while reusing the model already in memory, avoiding redundant loading, avoiding eviction, and reducing complexity.

🚀 Execution Plan

Step 1: Add from_existing_model() constructor

Add a new constructor method to HuggingfaceGPTModelForCausalLM that allows wrapping an existing GPTModel instance, e.g.

@classmethod
def from_existing_model(cls, model: GPTModel) -> HuggingfaceGPTModelForCausalLM:
    config = HuggingfaceGPTModelConfig(fast_llm_config=model.config)
    obj = cls(config)
    obj._fast_llm_model = model
    return obj

Notes:

  • HuggingfaceGPTModelConfig already holds a GPTModelConfig, so no need to explicitly construct it if we already have a GPTModel.
  • We need to assign fields like .runner and .schedule because they'll be used during generation.

Step 2: Implement a TemplateLM subclass for Fast-LLM

Create a subclass of lm_eval.api.model.TemplateLM that wraps an instance of HuggingfaceGPTModelForCausalLM and provides the required methods:

  • tok_encode()
  • loglikelihood(), loglikelihood_rolling()
  • generate_until()
  • eot_token_id

Use the HuggingFace tokenizer that pairs with the Fast-LLM model. Assume greedy decoding only. No need to support chat templates or SFT-specific tokenization quirks yet.

Step 3: Integration test

  • Load a small model like HuggingFaceTB/SmolLM2-135M-Instruct.
  • Wrap the in-memory Fast-LLM model using from_existing_model(...).
  • Use lm_eval.simple_evaluate(...) to run one or more generative tasks (e.g., hellaswag, arc_challenge, winogrande).
  • Validate that results match expectations.

Step 4: Extend Fast-LLM's validation config to support lm-eval-harness tasks

  • Extend the Fast-LLM config to accept a list of generative evaluation tasks using lm-eval-harness.
    • Fields to support:
      • tasks: list of task names (e.g. ["hellaswag", "arc_challenge"])
      • num_fewshot: number of few-shot examples to use per task.
  • Implement logic that:
    • Runs the lm-eval-harness only on global rank 0.
    • Constructs the TemplateLM wrapper for the in-memory Fast-LLM model.
    • Calls simple_evaluate(...) with the configured tasks.
    • Relies on Fast-LLM’s forward() for token-level inference, which is already distributed across GPUs and hosts.
  • Add support for logging results (e.g. to stdout and WandB), and disable lm-eval progress bars because Fast-LLM typically runs in a headless interface.

📌 Acceptance Criteria (Must-Haves for Completion)

  • Must be able to wrap an in-memory GPTModel in a HuggingfaceGPTModelForCausalLM via from_existing_model() without disk I/O.
  • Must implement a subclass of TemplateLM that:
    • Uses Fast-LLM's HuggingFace-compatible model (HuggingfaceGPTModelForCausalLM) for all inference.
    • Implements generate_until, loglikelihood, and loglikelihood_rolling.
    • Uses the correct tokenizer, PAD token ID, and EOS token ID.
  • Must support calling lm_eval.simple_evaluate(...) using the wrapped model and produce correct results.
  • Must extend Fast-LLM's validation/evaluation configuration to support:
    • Specifying lm-eval-harness tasks by name.
    • Setting num_fewshot.
  • Must ensure lm-eval-harness runs only on global rank 0, while model.forward() is transparently distributed using Fast-LLM’s runner logic.
  • Must include:
    • A working test that evaluates at least one lm-eval task on a small model (SmolLM2-135M-Instruct or similar).
    • Logging of evaluation results (stdout and WandB).
  • Implementation must be documented:
    • Configs in docs that show how to run lm-eval's generative benchmarks.

📎 Relevant Links

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions