-
Notifications
You must be signed in to change notification settings - Fork 38
Description
🎯 Goal (What & Why)
Enable Fast-LLM to run structured evaluations using lm-eval-harness.
This allows benchmarking Fast-LLM models across many standard tasks using the in-memory model during validation, leveraging the existing HuggingFace-compatible interface improved in #217.
Note that the current HuggingfaceGPTModelForCausalLM.from_pretrained(...)
API always reloads the model from disk. This breaks the intended workflow, where we keep the model sharded and in memory across all GPUs. We want to integrate with lm-eval-harness
while reusing the model already in memory, avoiding redundant loading, avoiding eviction, and reducing complexity.
🚀 Execution Plan
Step 1: Add from_existing_model()
constructor
Add a new constructor method to HuggingfaceGPTModelForCausalLM
that allows wrapping an existing GPTModel
instance, e.g.
@classmethod
def from_existing_model(cls, model: GPTModel) -> HuggingfaceGPTModelForCausalLM:
config = HuggingfaceGPTModelConfig(fast_llm_config=model.config)
obj = cls(config)
obj._fast_llm_model = model
return obj
Notes:
HuggingfaceGPTModelConfig
already holds aGPTModelConfig
, so no need to explicitly construct it if we already have aGPTModel
.- We need to assign fields like
.runner
and.schedule
because they'll be used during generation.
Step 2: Implement a TemplateLM
subclass for Fast-LLM
Create a subclass of lm_eval.api.model.TemplateLM
that wraps an instance of HuggingfaceGPTModelForCausalLM
and provides the required methods:
tok_encode()
loglikelihood()
,loglikelihood_rolling()
generate_until()
eot_token_id
Use the HuggingFace tokenizer that pairs with the Fast-LLM model. Assume greedy decoding only. No need to support chat templates or SFT-specific tokenization quirks yet.
Step 3: Integration test
- Load a small model like
HuggingFaceTB/SmolLM2-135M-Instruct
. - Wrap the in-memory Fast-LLM model using
from_existing_model(...)
. - Use
lm_eval.simple_evaluate(...)
to run one or more generative tasks (e.g.,hellaswag
,arc_challenge
,winogrande
). - Validate that results match expectations.
Step 4: Extend Fast-LLM's validation config to support lm-eval-harness tasks
- Extend the Fast-LLM config to accept a list of generative evaluation tasks using lm-eval-harness.
- Fields to support:
tasks
: list of task names (e.g.["hellaswag", "arc_challenge"]
)num_fewshot
: number of few-shot examples to use per task.
- Fields to support:
- Implement logic that:
- Runs the lm-eval-harness only on global rank 0.
- Constructs the
TemplateLM
wrapper for the in-memory Fast-LLM model. - Calls
simple_evaluate(...)
with the configured tasks. - Relies on Fast-LLM’s
forward()
for token-level inference, which is already distributed across GPUs and hosts.
- Add support for logging results (e.g. to stdout and WandB), and disable lm-eval progress bars because Fast-LLM typically runs in a headless interface.
📌 Acceptance Criteria (Must-Haves for Completion)
- Must be able to wrap an in-memory
GPTModel
in aHuggingfaceGPTModelForCausalLM
viafrom_existing_model()
without disk I/O. - Must implement a subclass of
TemplateLM
that:- Uses Fast-LLM's HuggingFace-compatible model (
HuggingfaceGPTModelForCausalLM
) for all inference. - Implements
generate_until
,loglikelihood
, andloglikelihood_rolling
. - Uses the correct tokenizer, PAD token ID, and EOS token ID.
- Uses Fast-LLM's HuggingFace-compatible model (
- Must support calling
lm_eval.simple_evaluate(...)
using the wrapped model and produce correct results. - Must extend Fast-LLM's validation/evaluation configuration to support:
- Specifying lm-eval-harness tasks by name.
- Setting
num_fewshot
.
- Must ensure lm-eval-harness runs only on global rank 0, while
model.forward()
is transparently distributed using Fast-LLM’s runner logic. - Must include:
- A working test that evaluates at least one lm-eval task on a small model (
SmolLM2-135M-Instruct
or similar). - Logging of evaluation results (stdout and WandB).
- A working test that evaluates at least one lm-eval task on a small model (
- Implementation must be documented:
- Configs in docs that show how to run lm-eval's generative benchmarks.
📎 Relevant Links
- lm-eval-harness interface guide
TemplateLM
interface:
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/model.py#L253- Fast-LLM HF model entry point:
https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/huggingface.py
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimate
field (in days) in the GitHub project. - Use the
Size
field to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.