-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3323 adaptive evaluation #3397
base: main
Are you sure you want to change the base?
3323 adaptive evaluation #3397
Conversation
Thanks! Overall this looks good and I think we can merge it with some changes. Questions:
High level feedback:
In terms of next steps, we should resolve the questions in the section above, and then you can let me know when this is ready for a full code review, and I'll send you requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer. |
Hi @yifanmai! Thank you so much for your comment. We addressed most of them in the latest commit and included the reply to your query below. Questions
We aim to make the tool production-ready and support most benchmarks in HELM.
We will upload the difficulty on HuggingFace.
We plan to set a default initial value of model ability (e.g., 0). The users can also set it (e.g., when they have good prior knowledge about the model's ability).
We will provide a separate Python package to compute the calibration and determine the difficulties of the questions of new datasets. High-level feedback
We have subclassed a
We have changed all the
Besides Air-Bench, we implement our method on MMLU, which might not have the above problems.
We now use the run_spec to find the main metric name and search list of metrics.
I would appreciate any suggestions you might have for further improvement.
Yes, I will find a time in your calendar so we can meet for pair programming. In the meantime, please let me know if you have any comments or feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the responses. I think your plan makes sense. Left you more detailed comments - we can merge after these are addressed.
src/helm/benchmark/reeval_runner.py
Outdated
|
||
for _ in tqdm(range(run_spec.reeval_max_samples), desc="Reeval execution"): | ||
try: | ||
selected_item = request_state_queue.get(block=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a priority queue? It looks like on each loop, you remove the item with the highest Fisher information, and then you recompute the Fisher information for all the items and rebuild the entire queue. It seems like you could just do this without maintaining a priority queue, which reduces the complexity from O(N log N) to O(N) per iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out. We now use a list instead of a priority queue
src/helm/benchmark/reeval_runner.py
Outdated
ability = torch.tensor([model_ability]) | ||
difficulty = torch.tensor([instance_difficulty]) | ||
p = torch.sigmoid(ability + difficulty).item() | ||
return -p * (1 - p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just need to find argmax abs(ability - difficulty)
here, because Fisher information increases monotonically with abs(ability - difficulty)
. Also, should ability + difficulty
be ability - difficulty
to match the paper (in which case this would be argmax abs(ability + difficulty)
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following your recommendation, we now use (ability – difficulty) in the HELM codebase and select the item that minimizes abs(ability – difficulty) instead of calculating the Fisher information
src/helm/benchmark/reeval_runner.py
Outdated
request_state: Any = dataclasses.field(compare=False) | ||
|
||
|
||
class ReevalRunner(Runner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I'm not a fan of "Reeval" as a name because it sounds too much like "re-evaluate"... Consider using a different name. Also, if you are planning to maintain this feature long-term for most datasets, then I would be okay with the original name of "Adaptive".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe RelEffEvalRunner
(Reliable and Efficient Evaluations)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment. We plan to change the class name back to Adaptive
# Get the instances necessary for this run. | ||
max_eval_instances = run_spec.adapter_spec.max_eval_instances | ||
eval_splits = run_spec.adapter_spec.eval_splits or EVAL_SPLITS | ||
if max_eval_instances is not None: | ||
instances = downsample_eval_instances(instances, max_eval_instances, eval_splits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need to pre-downsample before adaptive sampling (i.e. two runs of downsampling), or can you just skip this downsampling and do the adaptive sampling procedure over the whole dataset? It seems like the latter would be more desirable.
src/helm/benchmark/run_spec.py
Outdated
reeval_mode: bool = False | ||
"""Whether to run in reeval mode -- select the most informative samples to evaluate next""" | ||
|
||
reeval_max_samples: int = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this parameter in ReevalParameters
inside AdapterSpec
instead - see my comment in AdapterSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
max_tokens=512, | ||
temperature=0.0, | ||
stop_sequences=[], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the additional parameters for reeval in AdapterSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
input=Input(text=question), | ||
references=list(map(answer_to_reference, answers)), | ||
split=split, | ||
extra_data={"difficulty": np.random.randn()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you implement the Hugging Face dataset loading logic rather than using randn()
?
hlog(f"Skipping run {run_spec.name} because run is completed and all output files exist.") | ||
return | ||
ensure_directory_exists(run_path) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the start of this method, check that we are in reeval mode and raise an error if we are not. Then you can remove all the if run_spec.reeval_mode
below.
src/helm/benchmark/reeval_runner.py
Outdated
|
||
# Execute the requests in an reeval manner | ||
model_ability = run_spec.adapter_spec.model_ability | ||
scenario_metric_name = run_spec.metric_specs[0].args["names"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still isn't a good assumption. My suggestion is to add a correctness_metric_name
to ReevalParameters
that specifies the metric name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/helm/benchmark/reeval_runner.py
Outdated
self, | ||
model_ability: float, | ||
instance_difficulty: float, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add return type annotations to all methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Improving evaluation efficiency by selecting informative questions via a priority queue dataloader. We demonstrate adaptive evaluation on AIR-Bench. To run:
Please let us know if your comments and feedback.