3323 adaptive evaluation #3397

sangttruong · 2025-03-02T18:12:05Z

Improving evaluation efficiency by selecting informative questions via a priority queue dataloader. We demonstrate adaptive evaluation on AIR-Bench. To run:

export SUITE_NAME=adaptive
export MODELS_TO_RUN=meta-llama/Llama-3.2-3B-Instruct
export HF_MODEL=meta-llama/Llama-3.2-3B-Instruct
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_adaptive_air_bench.conf
export SCHEMA_PATH=schema_air_bench.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=2
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --enable-huggingface-models $HF_MODEL --dry-run

Please let us know if your comments and feedback.

yifanmai · 2025-03-04T21:39:07Z

Thanks! Overall this looks good and I think we can merge it with some changes.

Questions:

What is the positioning of this tool? Is it meant to only be used to reproduce the paper's results, or to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?
Currently the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?
For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users should provide the parameters themselves, or should the initial model ability be set to some constant?
Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

High level feedback:

Changes to core framework code should be minimized, to avoid increasing software complexity and maintenance costs, and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there, rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.
If this is an experimental tool, it should be clear made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.
Unfortunately AIR-Bench is not a good application for this, because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.
Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not true generally for all datasets.

In terms of next steps, we should resolve the questions in the section above, and then you can let me know when this is ready for a full code review, and I'll send you requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer.

sangttruong · 2025-03-08T19:09:26Z

Hi @yifanmai! Thank you so much for your comment. We addressed most of them in the latest commit and included the reply to your query below.

Questions

What is the positioning of this tool? Is it meant to be used only to reproduce the paper's results, to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?

We aim to make the tool production-ready and support most benchmarks in HELM.

Currently, the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?

We will upload the difficulty on HuggingFace.

For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users provide the parameters themselves, or should the initial model ability be set to some constant?

We plan to set a default initial value of model ability (e.g., 0). The users can also set it (e.g., when they have good prior knowledge about the model's ability).

Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

We will provide a separate Python package to compute the calibration and determine the difficulties of the questions of new datasets.

High-level feedback

Changes to core framework code should be minimized to avoid increasing software complexity and maintenance costs and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.

We have subclassed a ReevalRunner in a separate file reeval_runner.py, and run the code using --runner-class-name helm.benchmark.reeval_runner.ReevalRunner. Now, the original runner code stays the same as the original code.
We used to add two parameters reeval_mode: bool = False and reeval_max_samples: int = 50 to the src/helm/benchmark/run_spec.py. Then, we create some files for each dataset. Using AIR-Bench as an example: we create run_entries_adaptive_air_bench.conf and add function get_adaptive_air_bench_2024_spec to air_bench_run_specs.py.

If this is an experimental tool, it should be made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.

We have changed all the adaptive names in the code to reeval (reliable and efficient evaluation), which is an abbreviation for our method. We can be very flexible with the naming. We will note in the documentation which benchmark will be supported by the tools.

Unfortunately, AIR-Bench is not a good application for this because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top-level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.

Besides Air-Bench, we implement our method on MMLU, which might not have the above problems.
To run MMLU using ReevalRunner:

cd src
export SUITE_NAME=subclass_reeval_mmlu_simple_model1
export MODELS_TO_RUN=simple/model1
export RUN_ENTRIES_CONF_PATH=helm/benchmark/presentation/run_entries_reeval_mmlu.conf
export SCHEMA_PATH=schema_mmlu.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=4
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --runner-class-name helm.benchmark.reeval_runner.ReevalRunner

Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not generally true for all datasets.

We now use the run_spec to find the main metric name and search list of metrics.

    scenario_metric_name = run_spec.metric_specs[0].args["names"][0]
    scenario_metric_value = [s for s in per_instance_stat[0].stats if s.name.name == scenario_metric_name][0].mean

I would appreciate any suggestions you might have for further improvement.

In terms of the next steps, we should resolve the questions in the section above. Then you can let me know when this is ready for a full code review, and I'll send you the requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer

Yes, I will find a time in your calendar so we can meet for pair programming. In the meantime, please let me know if you have any comments or feedback.

yifanmai

Thanks for the responses. I think your plan makes sense. Left you more detailed comments - we can merge after these are addressed.

yifanmai · 2025-03-10T21:19:43Z

src/helm/benchmark/reeval_runner.py

+
+        for _ in tqdm(range(run_spec.reeval_max_samples), desc="Reeval execution"):
+            try:
+                selected_item = request_state_queue.get(block=False)


Do you need a priority queue? It looks like on each loop, you remove the item with the highest Fisher information, and then you recompute the Fisher information for all the items and rebuild the entire queue. It seems like you could just do this without maintaining a priority queue, which reduces the complexity from O(N log N) to O(N) per iteration.

Thank you for pointing this out. We now use a list instead of a priority queue

yifanmai · 2025-03-10T21:21:36Z

src/helm/benchmark/reeval_runner.py

+        ability = torch.tensor([model_ability])
+        difficulty = torch.tensor([instance_difficulty])
+        p = torch.sigmoid(ability + difficulty).item()
+        return -p * (1 - p)


I think you can just need to find argmax abs(ability - difficulty) here, because Fisher information increases monotonically with abs(ability - difficulty). Also, should ability + difficulty be ability - difficulty to match the paper (in which case this would be argmax abs(ability + difficulty))?

Following your recommendation, we now use (ability – difficulty) in the HELM codebase and select the item that minimizes abs(ability – difficulty) instead of calculating the Fisher information

yifanmai · 2025-03-10T21:35:07Z

src/helm/benchmark/reeval_runner.py

+    request_state: Any = dataclasses.field(compare=False)
+
+
+class ReevalRunner(Runner):


Optional: I'm not a fan of "Reeval" as a name because it sounds too much like "re-evaluate"... Consider using a different name. Also, if you are planning to maintain this feature long-term for most datasets, then I would be okay with the original name of "Adaptive".

Or maybe RelEffEvalRunner (Reliable and Efficient Evaluations)?

Thank you for the comment. We plan to change the class name back to Adaptive

yifanmai · 2025-03-10T21:39:57Z

src/helm/benchmark/reeval_runner.py

+        # Get the instances necessary for this run.
+        max_eval_instances = run_spec.adapter_spec.max_eval_instances
+        eval_splits = run_spec.adapter_spec.eval_splits or EVAL_SPLITS
+        if max_eval_instances is not None:
+            instances = downsample_eval_instances(instances, max_eval_instances, eval_splits)


Do you still need to pre-downsample before adaptive sampling (i.e. two runs of downsampling), or can you just skip this downsampling and do the adaptive sampling procedure over the whole dataset? It seems like the latter would be more desirable.

yifanmai · 2025-03-10T21:43:50Z

src/helm/benchmark/run_spec.py

+    reeval_mode: bool = False
+    """Whether to run in reeval mode -- select the most informative samples to evaluate next"""
+
+    reeval_max_samples: int = 50


Move this parameter in ReevalParameters inside AdapterSpec instead - see my comment in AdapterSpec.

yifanmai · 2025-03-10T21:59:44Z

src/helm/benchmark/run_specs/air_bench_run_specs.py

+        max_tokens=512,
+        temperature=0.0,
+        stop_sequences=[],
+    )


Set the additional parameters for reeval in AdapterSpec.

yifanmai · 2025-03-10T22:02:17Z

src/helm/benchmark/scenarios/mmlu_scenario.py

+                    input=Input(text=question),
+                    references=list(map(answer_to_reference, answers)),
+                    split=split,
+                    extra_data={"difficulty": np.random.randn()},


Could you implement the Hugging Face dataset loading logic rather than using randn()?

yifanmai · 2025-03-10T22:03:07Z

src/helm/benchmark/reeval_runner.py

+            hlog(f"Skipping run {run_spec.name} because run is completed and all output files exist.")
+            return
+        ensure_directory_exists(run_path)
+


At the start of this method, check that we are in reeval mode and raise an error if we are not. Then you can remove all the if run_spec.reeval_mode below.

yifanmai · 2025-03-10T22:05:12Z

src/helm/benchmark/reeval_runner.py

+
+        # Execute the requests in an reeval manner
+        model_ability = run_spec.adapter_spec.model_ability
+        scenario_metric_name = run_spec.metric_specs[0].args["names"][0]


This still isn't a good assumption. My suggestion is to add a correctness_metric_name to ReevalParameters that specifies the metric name.

yifanmai · 2025-03-10T22:06:59Z

src/helm/benchmark/reeval_runner.py

+        self,
+        model_ability: float,
+        instance_difficulty: float,
+    ):


Add return type annotations to all methods.

Yuheng Tu and others added 2 commits March 2, 2025 09:48

Initial code for adaptive evaluation

219c714

Merge branch 'stanford-crfm:main' into 3323-adaptive_evaluation

94b0316

second commit to apply feedback

9a182b9

yifanmai requested changes Mar 10, 2025

View reviewed changes

third commit on feedback

5b2588c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3323 adaptive evaluation #3397

3323 adaptive evaluation #3397

sangttruong commented Mar 2, 2025

yifanmai commented Mar 4, 2025

sangttruong commented Mar 8, 2025

yifanmai left a comment

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

yifanmai Mar 10, 2025

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

yifanmai Mar 10, 2025

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

yifanmai Mar 10, 2025

yifanmai Mar 10, 2025

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

yifanmai Mar 10, 2025

sangttruong Mar 11, 2025

		request_state: Any = dataclasses.field(compare=False)


		class ReevalRunner(Runner):

3323 adaptive evaluation #3397

Are you sure you want to change the base?

3323 adaptive evaluation #3397

Conversation

sangttruong commented Mar 2, 2025

yifanmai commented Mar 4, 2025

sangttruong commented Mar 8, 2025

yifanmai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment