Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3323 adaptive evaluation #3397

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sangttruong
Copy link

Improving evaluation efficiency by selecting informative questions via a priority queue dataloader. We demonstrate adaptive evaluation on AIR-Bench. To run:

export SUITE_NAME=adaptive
export MODELS_TO_RUN=meta-llama/Llama-3.2-3B-Instruct
export HF_MODEL=meta-llama/Llama-3.2-3B-Instruct
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_adaptive_air_bench.conf
export SCHEMA_PATH=schema_air_bench.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=2
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --enable-huggingface-models $HF_MODEL --dry-run

Please let us know if your comments and feedback.

@yifanmai
Copy link
Collaborator

yifanmai commented Mar 4, 2025

Thanks! Overall this looks good and I think we can merge it with some changes.

Questions:

  • What is the positioning of this tool? Is it meant to only be used to reproduce the paper's results, or to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?
  • Currently the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?
  • For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users should provide the parameters themselves, or should the initial model ability be set to some constant?
  • Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

High level feedback:

  • Changes to core framework code should be minimized, to avoid increasing software complexity and maintenance costs, and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there, rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.
  • If this is an experimental tool, it should be clear made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.
  • Unfortunately AIR-Bench is not a good application for this, because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.
  • Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not true generally for all datasets.

In terms of next steps, we should resolve the questions in the section above, and then you can let me know when this is ready for a full code review, and I'll send you requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer.

@sangttruong
Copy link
Author

Hi @yifanmai! Thank you so much for your comment. We addressed most of them in the latest commit and included the reply to your query below.

Questions

What is the positioning of this tool? Is it meant to be used only to reproduce the paper's results, to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?

We aim to make the tool production-ready and support most benchmarks in HELM.

Currently, the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?

We will upload the difficulty on HuggingFace.

For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users provide the parameters themselves, or should the initial model ability be set to some constant?

We plan to set a default initial value of model ability (e.g., 0). The users can also set it (e.g., when they have good prior knowledge about the model's ability).

Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

We will provide a separate Python package to compute the calibration and determine the difficulties of the questions of new datasets.

High-level feedback

Changes to core framework code should be minimized to avoid increasing software complexity and maintenance costs and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.

We have subclassed a ReevalRunner in a separate file reeval_runner.py, and run the code using --runner-class-name helm.benchmark.reeval_runner.ReevalRunner. Now, the original runner code stays the same as the original code.
We used to add two parameters reeval_mode: bool = False and reeval_max_samples: int = 50 to the src/helm/benchmark/run_spec.py. Then, we create some files for each dataset. Using AIR-Bench as an example: we create run_entries_adaptive_air_bench.conf and add function get_adaptive_air_bench_2024_spec to air_bench_run_specs.py.

If this is an experimental tool, it should be made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.

We have changed all the adaptive names in the code to reeval (reliable and efficient evaluation), which is an abbreviation for our method. We can be very flexible with the naming. We will note in the documentation which benchmark will be supported by the tools.

Unfortunately, AIR-Bench is not a good application for this because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top-level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.

Besides Air-Bench, we implement our method on MMLU, which might not have the above problems.
To run MMLU using ReevalRunner:

cd src
export SUITE_NAME=subclass_reeval_mmlu_simple_model1
export MODELS_TO_RUN=simple/model1
export RUN_ENTRIES_CONF_PATH=helm/benchmark/presentation/run_entries_reeval_mmlu.conf
export SCHEMA_PATH=schema_mmlu.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=4
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --runner-class-name helm.benchmark.reeval_runner.ReevalRunner

Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not generally true for all datasets.

We now use the run_spec to find the main metric name and search list of metrics.

    scenario_metric_name = run_spec.metric_specs[0].args["names"][0]
    scenario_metric_value = [s for s in per_instance_stat[0].stats if s.name.name == scenario_metric_name][0].mean

I would appreciate any suggestions you might have for further improvement.

In terms of the next steps, we should resolve the questions in the section above. Then you can let me know when this is ready for a full code review, and I'll send you the requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer

Yes, I will find a time in your calendar so we can meet for pair programming. In the meantime, please let me know if you have any comments or feedback.

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses. I think your plan makes sense. Left you more detailed comments - we can merge after these are addressed.


for _ in tqdm(range(run_spec.reeval_max_samples), desc="Reeval execution"):
try:
selected_item = request_state_queue.get(block=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a priority queue? It looks like on each loop, you remove the item with the highest Fisher information, and then you recompute the Fisher information for all the items and rebuild the entire queue. It seems like you could just do this without maintaining a priority queue, which reduces the complexity from O(N log N) to O(N) per iteration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. We now use a list instead of a priority queue

Comment on lines 90 to 93
ability = torch.tensor([model_ability])
difficulty = torch.tensor([instance_difficulty])
p = torch.sigmoid(ability + difficulty).item()
return -p * (1 - p)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just need to find argmax abs(ability - difficulty) here, because Fisher information increases monotonically with abs(ability - difficulty). Also, should ability + difficulty be ability - difficulty to match the paper (in which case this would be argmax abs(ability + difficulty))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your recommendation, we now use (ability – difficulty) in the HELM codebase and select the item that minimizes abs(ability – difficulty) instead of calculating the Fisher information

request_state: Any = dataclasses.field(compare=False)


class ReevalRunner(Runner):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I'm not a fan of "Reeval" as a name because it sounds too much like "re-evaluate"... Consider using a different name. Also, if you are planning to maintain this feature long-term for most datasets, then I would be okay with the original name of "Adaptive".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe RelEffEvalRunner (Reliable and Efficient Evaluations)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. We plan to change the class name back to Adaptive

Comment on lines +205 to +209
# Get the instances necessary for this run.
max_eval_instances = run_spec.adapter_spec.max_eval_instances
eval_splits = run_spec.adapter_spec.eval_splits or EVAL_SPLITS
if max_eval_instances is not None:
instances = downsample_eval_instances(instances, max_eval_instances, eval_splits)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need to pre-downsample before adaptive sampling (i.e. two runs of downsampling), or can you just skip this downsampling and do the adaptive sampling procedure over the whole dataset? It seems like the latter would be more desirable.

reeval_mode: bool = False
"""Whether to run in reeval mode -- select the most informative samples to evaluate next"""

reeval_max_samples: int = 50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this parameter in ReevalParameters inside AdapterSpec instead - see my comment in AdapterSpec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

max_tokens=512,
temperature=0.0,
stop_sequences=[],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the additional parameters for reeval in AdapterSpec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

input=Input(text=question),
references=list(map(answer_to_reference, answers)),
split=split,
extra_data={"difficulty": np.random.randn()},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you implement the Hugging Face dataset loading logic rather than using randn()?

hlog(f"Skipping run {run_spec.name} because run is completed and all output files exist.")
return
ensure_directory_exists(run_path)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the start of this method, check that we are in reeval mode and raise an error if we are not. Then you can remove all the if run_spec.reeval_mode below.


# Execute the requests in an reeval manner
model_ability = run_spec.adapter_spec.model_ability
scenario_metric_name = run_spec.metric_specs[0].args["names"][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still isn't a good assumption. My suggestion is to add a correctness_metric_name to ReevalParameters that specifies the metric name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

self,
model_ability: float,
instance_difficulty: float,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add return type annotations to all methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants