Possibly wrong checkpoints for M2 and L #2

jglaser · 2024-06-06T07:12:51Z

I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.

These are my results

Model	SciQ	PIQA
forge-bio	0.788
forge-che	0.821
forge-eng	0.793
forge-mat	0.777
forge-phy	0.761
forge-soc	0.82
forge-s1	0.787
forge-s2	0.783
forge-s3	0.805
forge-s4	0.86
forge-m1	0.82
forge-m2	0.574	0.5577
forge-l	0.242

The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.

I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.

Command line

 lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly wrong checkpoints for M2 and L #2

Possibly wrong checkpoints for M2 and L #2

jglaser commented Jun 6, 2024

Possibly wrong checkpoints for M2 and L #2

Possibly wrong checkpoints for M2 and L #2

Comments

jglaser commented Jun 6, 2024