You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
Model
SciQ
PIQA
forge-bio
0.788
forge-che
0.821
forge-eng
0.793
forge-mat
0.777
forge-phy
0.761
forge-soc
0.82
forge-s1
0.787
forge-s2
0.783
forge-s3
0.805
forge-s4
0.86
forge-m1
0.82
forge-m2
0.574
0.5577
forge-l
0.242
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.
I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda
The text was updated successfully, but these errors were encountered:
I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (
data/eval/forge-m2
) suggests that these are roughly the scores of them2
checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
The text was updated successfully, but these errors were encountered: