Question about accuracy #12

fyf3 · 2024-04-23T06:54:04Z

Hello, I have some questions about the accuracy of llama2-7b.

In the Table 5, the accuracy of llama2-7b-base on MMLU/TYDIQA/BBH are 46.7/52.1/39.8, but we use llama2-7b from "https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main" to test as 46.0/42.5/40.4, why is it so different from the table?

Also, we trained using the provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, the results of our reproduction are 49.3/54.0/42.3, lower than 50.2/56.2/41.5 in the table.

Can you kindly explain it to me? Thanks!

xiamengzhou · 2024-04-24T00:35:16Z

Hi -- Thanks for the interest in our work, and try out the experiments!

Inconsistency in Table 5 and your evaluation results

Firstly, I would like to point out that Table 5 does not present the base model performance of llama-2-7b-hf. Instead, it demonstrates the performance of selecting the top 5% of the data using the gradients of the llama-2-7b-hf model for data selection purposes. The actual performance of the llama-2-7b-hf model can be found in the first column of Table 10. To further investigate the reported results, I ran the experiments again using an H100 GPU and obtained the following results:

	Run on H100	Reported in paper (Table 10)	Your result
BBH	38.5	38.3	40.4
TydiQA	47.5	46.4	42.5
MMLU	45.7	45.6	46.0

You can see that this new run still does not fully reproduce the results reported in the paper. Given that the results I ran now and reported in the paper were on different hardware and envs that I don’t have access to anymore, I am not sure what could be the problem. Also, it's worth noting that subtle variations in batch size can affect evaluation results, as discussed in this known issue.

Despite this, there remains a discrepancy between my latest run and your results. I have uploaded the code I used to obtain these results here. I hope this will help you in reproducing my results.

provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, and you get 50.2/56.2/41.5 in the table

The main issue seems to be on TydiQA. In retrospect, we realized that for TydiQA, we selected the third checkpoints instead of the last because we observed a significant performance degradation after the fourth epoch. This decision is consistent with our approach for random selection, hence we reported the results as of epoch=3 in the paper. The full results are presented below.

Ideally, a validation set should be used to determine the optimal checkpoints. To make this process more rigorous, one might consider forming a validation set from the training data of this dataset. However, we found that using just 9 examples as a validation set was too noisy for TydiQA.

	Epoch=1	Epoch=2	Epoch=3	Epoch=4
seed=3	56.4	56.8	56.4	53.9
seed=6	54.5	54.0	55.5	54.6
seed=9	54.7	54.6	56.8	54.9
Average	55.2	55.2	56.2	54.5

the results of our reproduction are 49.3/54.0/42.3, lower than the reported 50.2/56.2/41.5

For TydiQA, the issue might be addressed by selecting the appropriate checkpoints. As for MMLU, the reason it scores 1 point lower is unclear to me. It could be due to a large variance in runs 😕

Hope this is helpful, and let me know if you have further questions!

xiamengzhou closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about accuracy #12

Question about accuracy #12

fyf3 commented Apr 23, 2024 •

edited

Loading

xiamengzhou commented Apr 24, 2024 •

edited

Loading

Question about accuracy #12

Question about accuracy #12

Comments

fyf3 commented Apr 23, 2024 • edited Loading

xiamengzhou commented Apr 24, 2024 • edited Loading

fyf3 commented Apr 23, 2024 •

edited

Loading

xiamengzhou commented Apr 24, 2024 •

edited

Loading