GPQA scenario and tests added #3068

liamjxu · 2024-10-17T05:44:39Z

No description provided.

liamjxu · 2024-10-17T05:48:36Z

Both the scenario (GPQA, 3 subsets, with HF datasets) and the tests are added. All tests passed.

@yifanmai Regarding the run spec function, should we create one in lite_run_specs.py or create a new separate file specifically for GPQA?

liamjxu · 2024-10-17T07:34:38Z

Currently seeing issues with the automatic checks, will look into it

yifanmai · 2024-10-17T16:43:51Z

FYI

The code formatting check failed. To fix the formatting, run:

	black src scripts
	git commit -a

yifanmai · 2024-10-17T16:57:16Z

src/helm/benchmark/scenarios/test_gpqa_scenario.py

+        instances = scenario.get_instances(tmpdir)
+        assert len(instances) == 448
+        assert instances[0].input == Input(
+            text="A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5' end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy?"


Avoid posting plain text from the dataset, as per the instructions from the dataset card for GPQA. Just check the string length instead.

yifanmai · 2024-10-17T17:40:12Z

src/helm/benchmark/scenarios/gpqa_scenario.py

+            instance = Instance(
+                input=input,
+                references=references,
+                split=TRAIN_SPLIT


In the paper, the authors defined five fixed train split examples in this file. All other examples are test split.

You should also filter out the train examples from the test set. Unfortunately, the examples in the train split are also in the Hugging Face dataset, so you may have to do this filtering manually. You could do something like hardcoding the indexes of the train examples, but the indexes might be different depending on the subset.

Also see the discussion here:

idavidrein/gpqa#28

yifanmai · 2024-10-17T17:43:25Z

src/helm/benchmark/scenarios/gpqa_scenario.py

+                Reference(Output(text=row["Correct Answer"].strip()), tags=[CORRECT_TAG]),
+                Reference(Output(text=row["Incorrect Answer 1"].strip()), tags=[]),
+                Reference(Output(text=row["Incorrect Answer 2"].strip()), tags=[]),
+                Reference(Output(text=row["Incorrect Answer 3"].strip()), tags=[])


This always sets the correct answer to A, which is not great. The authors shuffle the choices deterministically, and you should do something similar.

… examples, reformatted.

yifanmai · 2024-10-21T17:37:13Z

src/helm/benchmark/scenarios/gpqa_scenario.py

+            if idx in EXCLUDED_TRAIN_EXAMPLES[self.subset]:
+                continue


This has no train split examples currently; let's address this in a different PR.

GPQA scenario and tests added

85743d7

liamjxu requested a review from yifanmai October 17, 2024 05:45

liamjxu linked an issue Oct 17, 2024 that may be closed by this pull request

Add GPQA scenario #3017

Closed

formatting

0116afb

yifanmai requested changes Oct 17, 2024

View reviewed changes

liamjxu force-pushed the jialiang/add_gpqa branch from 4c91a01 to eab0e56 Compare October 21, 2024 06:52

Changed ways of checking texts, randomized references, excluded train…

e445043

… examples, reformatted.

liamjxu force-pushed the jialiang/add_gpqa branch from eab0e56 to e445043 Compare October 21, 2024 07:41

yifanmai self-requested a review October 21, 2024 17:34

yifanmai approved these changes Oct 21, 2024

View reviewed changes

yifanmai merged commit f5ca627 into main Oct 21, 2024
12 checks passed

yifanmai deleted the jialiang/add_gpqa branch October 21, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPQA scenario and tests added #3068

GPQA scenario and tests added #3068

liamjxu commented Oct 17, 2024

liamjxu commented Oct 17, 2024

liamjxu commented Oct 17, 2024

yifanmai commented Oct 17, 2024

yifanmai Oct 17, 2024

yifanmai Oct 17, 2024

yifanmai Oct 17, 2024

yifanmai Oct 21, 2024

GPQA scenario and tests added #3068

GPQA scenario and tests added #3068

Conversation

liamjxu commented Oct 17, 2024

liamjxu commented Oct 17, 2024

liamjxu commented Oct 17, 2024

yifanmai commented Oct 17, 2024

yifanmai Oct 17, 2024

Choose a reason for hiding this comment

yifanmai Oct 17, 2024

Choose a reason for hiding this comment

yifanmai Oct 17, 2024

Choose a reason for hiding this comment

yifanmai Oct 21, 2024

Choose a reason for hiding this comment