Adding the IFEval scenario #3122

liamjxu · 2024-10-31T06:46:33Z

Adding the scenario, run specs, and metric for IFEval. No new adapters were added, instead reused the GenerationAdapter.

yifanmai

Awesome, thanks! Left some comments.

Could you open a separate pull request to add IFEval to schema_lite_v2.yaml? Basically, follow what has already been done with gpqa. You'll also need to add ifeval_strict_accuracy to the metrics (follow exact_match).

yifanmai · 2024-10-31T16:42:58Z

src/helm/benchmark/scenarios/ifeval_scenario.py

+        super().__init__()
+
+    def get_instances(self, output_path: str) -> List[Instance]:
+        # Get GPQA from HuggingFace


Delete this comment.

Addressed in the latest change.

yifanmai · 2024-10-31T16:43:05Z

src/helm/benchmark/scenarios/ifeval_scenario.py

+
+    name = "ifeval"
+    description = "Instruction-Following Evaluation for Large Language Models"
+    tags = ["question answering"]


"instruction following"

Addressed in the latest change.

yifanmai · 2024-10-31T16:46:18Z

src/helm/benchmark/scenarios/ifeval_scenario.py

+                input=input,
+                references=[],
+                split=TEST_SPLIT,
+                extra_data={"instruction_id_list": row["instruction_id_list"], "question_kwargs": row["kwargs"]},


nit:

"instruction_id_list" -> "instruction_ids"
"question_kwargs" -> "instruction_kwargs"

Update the key names in the metrics as well.

Addressed in the latest change.

yifanmai · 2024-10-31T16:56:54Z

src/helm/benchmark/metrics/common_metric_specs.py

+def get_ifeval_metric_specs() -> List[MetricSpec]:
+    return [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")]


This is scenario-specific so don't put it in common_metric_specs; just inline adapter_specs = [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")] in the run spec function.

Got it, thanks!

I think you meant metric_specs = ..., not adapter?

With that assumption, addressed in the latest change.

yifanmai · 2024-10-31T17:08:29Z

src/helm/benchmark/metrics/ifeval_instructions.py

Move these files to a ifeval subpackage within metrics (except ifeval_metrics.py which can stay under metrics).

Don't run the linter on this file; just reproduce the raw contents exactly (except for import statements). This makes it easier for someone to audit that the code is unchanged using diff.

Add the following lines to the start of the file to skip the linter:

# flake8: noqa # type: ignore # The following code has reproduced with minor modifications to `import` statements from the following URL: # https://github.com/google-research/google-research/blob/c7f60c013623e613732a096e2a0c2872491ec912/instruction_following_eval/instructions.py

Tip: you can get the permalink version of the GitHub URL with the githash by going to the latest version and pressing 'y' on your keyboard.

Likewise for the other ifeval_instructions* files.

Addressed in the latest change.

P.S. Thanks for the tip, it's convenient! The shortcut for me is Shift+Ctrl+, for some unknown reason, but it works too.

yifanmai · 2024-10-31T17:13:37Z

src/helm/benchmark/metrics/ifeval_metrics.py

+from helm.benchmark.metrics.metric_service import MetricService
+from helm.benchmark.metrics.statistic import Stat
+
+import src.helm.benchmark.metrics.ifeval_instructions_registry as instructions_registry


Remove src. from this import.

Addressed in the latest change.

yifanmai · 2024-10-31T17:16:13Z

src/helm/benchmark/metrics/ifeval_instructions.py

+import src.helm.benchmark.metrics.ifeval_instructions_util as instructions_util
+
+from src.helm.benchmark.metrics.ifeval_instructions_util import LANGUAGE_CODES


Delete src. from these imports.

Likewise for the other files.

Addressed in the latest change.

yifanmai · 2024-10-31T17:20:35Z

src/helm/benchmark/metrics/ifeval_metrics.py

+        response = request_state.result.completions[0].text.strip()
+
+        is_following_list = []
+        for index, instruction_id in enumerate(instruction_id_list):


Add a link in a comment to https://github.com/google-research/google-research/blob/c7f60c013623e613732a096e2a0c2872491ec912/instruction_following_eval/evaluation_main.py#L96-L125

Addressed in the latest change.

yifanmai · 2024-10-31T17:36:34Z

src/helm/benchmark/metrics/ifeval_metrics.py

+            else:
+                is_following_list.append(0)
+
+        return [Stat(MetricName("strict_accuracy")).add(sum(is_following_list) / len(is_following_list))]


nit: "strict_accuracy" -> "ifeval_strict_accuracy" - the name is generic enough that we should probably namespace it.

Addressed in the latest change.

yifanmai · 2024-10-31T17:38:05Z

Also I have no idea why the tests are failing... I'll look into it.

liamjxu · 2024-10-31T19:06:45Z

Also I have no idea why the tests are failing... I'll look into it.

I think this is because the IFEval code has these two dependencies not in helm yet: langdetect, immutabledict

I installed them locally to make it run.

liamjxu · 2024-10-31T19:07:37Z

Awesome, thanks! Left some comments.

Could you open a separate pull request to add IFEval to schema_lite_v2.yaml? Basically, follow what has already been done with gpqa. You'll also need to add ifeval_strict_accuracy to the metrics (follow exact_match).

Sure, I will do this after addressing all the comments.

This reverts commit 3f53b7c.

liamjxu · 2024-10-31T20:31:42Z

Found 12 errors in 2 files (checked 650 source files)
src/helm/benchmark/scenarios/test_ifeval_scenario.py:31: error: Invalid index type "str" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:32: error: Invalid index type "str | Any" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:33: error: Invalid index type "str" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:34: error: Invalid index type "str" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:35: error: Invalid index type "str | Any" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/metrics/ifeval_metrics.py:22: error: Value of type "dict[str, str] | None" is not indexable  [index]
src/helm/benchmark/metrics/ifeval_metrics.py:23: error: Value of type "dict[str, str] | None" is not indexable  [index]
src/helm/benchmark/metrics/ifeval_metrics.py:34: error: Module has no attribute "INSTRUCTION_DICT"  [attr-defined]
src/helm/benchmark/metrics/ifeval_metrics.py:37: error: Item "str" of "str | Any" has no attribute "items"  [union-attr]
Error: Process completed with exit code 1.

I looked into the testing failure errors and realized that the type checker was failing.

In the current implementation, the extra_data field in the Instance class is annotated with type Optional[Dict[str, str]], yet in IFEval, instruction_ids maps to a list of strings and instruction_kwargs maps to a list of dictionaries.

@yifanmai Should we linearize IFEval's extra data, or should we update the type annotation of the extra_data field?

yifanmai · 2024-10-31T20:57:07Z

Let's change extra_data to type Dict[str, Any] - does this make the type checker pass? My opinion is that the value can consist of any JSON serializable object, including nested dicts and lists,

….extra_data

liamjxu added 2 commits October 30, 2024 18:38

adding IFEval scenario

5af4f40

adding the metric for IFEval and its helper files

f72c9bf

liamjxu self-assigned this Oct 31, 2024

liamjxu added 2 commits October 30, 2024 23:48

formatting: our code

9ca9ac7

formatting: original code

3f53b7c

yifanmai requested changes Oct 31, 2024

View reviewed changes

liamjxu added 3 commits October 31, 2024 12:21

Revert "formatting: original code" to avoid linting

39920af

This reverts commit 3f53b7c.

resolved comments

87ea64c

formatting

35c710c

liamjxu requested a review from yifanmai October 31, 2024 20:32

added optional dependencies, and changed type annotation for Instance…

e387e67

….extra_data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the IFEval scenario #3122

Adding the IFEval scenario #3122

liamjxu commented Oct 31, 2024

yifanmai left a comment

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai Oct 31, 2024

liamjxu Oct 31, 2024

yifanmai commented Oct 31, 2024

liamjxu commented Oct 31, 2024 •

edited

Loading

liamjxu commented Oct 31, 2024

liamjxu commented Oct 31, 2024 •

edited

Loading

yifanmai commented Oct 31, 2024

		def get_ifeval_metric_specs() -> List[MetricSpec]:
		return [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")]

		import src.helm.benchmark.metrics.ifeval_instructions_util as instructions_util

		from src.helm.benchmark.metrics.ifeval_instructions_util import LANGUAGE_CODES

Adding the IFEval scenario #3122

Are you sure you want to change the base?

Adding the IFEval scenario #3122

Conversation

liamjxu commented Oct 31, 2024

yifanmai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai commented Oct 31, 2024

liamjxu commented Oct 31, 2024 • edited Loading

liamjxu commented Oct 31, 2024

liamjxu commented Oct 31, 2024 • edited Loading

yifanmai commented Oct 31, 2024

liamjxu commented Oct 31, 2024 •

edited

Loading

liamjxu commented Oct 31, 2024 •

edited

Loading