Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about accuracy #12

Closed
fyf3 opened this issue Apr 23, 2024 · 1 comment
Closed

Question about accuracy #12

fyf3 opened this issue Apr 23, 2024 · 1 comment

Comments

@fyf3
Copy link

fyf3 commented Apr 23, 2024

Hello, I have some questions about the accuracy of llama2-7b.

In the Table 5, the accuracy of llama2-7b-base on MMLU/TYDIQA/BBH are 46.7/52.1/39.8, but we use llama2-7b from "https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main" to test as 46.0/42.5/40.4, why is it so different from the table?

Also, we trained using the provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, the results of our reproduction are 49.3/54.0/42.3, lower than 50.2/56.2/41.5 in the table.

Can you kindly explain it to me? Thanks!

@xiamengzhou
Copy link
Collaborator

xiamengzhou commented Apr 24, 2024

Hi -- Thanks for the interest in our work, and try out the experiments!

Inconsistency in Table 5 and your evaluation results

Firstly, I would like to point out that Table 5 does not present the base model performance of llama-2-7b-hf. Instead, it demonstrates the performance of selecting the top 5% of the data using the gradients of the llama-2-7b-hf model for data selection purposes. The actual performance of the llama-2-7b-hf model can be found in the first column of Table 10. To further investigate the reported results, I ran the experiments again using an H100 GPU and obtained the following results:

  Run on H100 Reported in paper (Table 10) Your result
BBH 38.5 38.3 40.4
TydiQA 47.5 46.4 42.5
MMLU 45.7 45.6 46.0

You can see that this new run still does not fully reproduce the results reported in the paper. Given that the results I ran now and reported in the paper were on different hardware and envs that I don’t have access to anymore, I am not sure what could be the problem. Also, it's worth noting that subtle variations in batch size can affect evaluation results, as discussed in this known issue.

Despite this, there remains a discrepancy between my latest run and your results. I have uploaded the code I used to obtain these results here. I hope this will help you in reproducing my results.

provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, and you get 50.2/56.2/41.5 in the table

The main issue seems to be on TydiQA. In retrospect, we realized that for TydiQA, we selected the third checkpoints instead of the last because we observed a significant performance degradation after the fourth epoch. This decision is consistent with our approach for random selection, hence we reported the results as of epoch=3 in the paper. The full results are presented below.

Ideally, a validation set should be used to determine the optimal checkpoints. To make this process more rigorous, one might consider forming a validation set from the training data of this dataset. However, we found that using just 9 examples as a validation set was too noisy for TydiQA.

  Epoch=1 Epoch=2 Epoch=3 Epoch=4
seed=3 56.4 56.8 56.4 53.9
seed=6 54.5 54.0 55.5 54.6
seed=9 54.7 54.6 56.8 54.9
Average 55.2 55.2 56.2 54.5

the results of our reproduction are 49.3/54.0/42.3, lower than the reported 50.2/56.2/41.5

For TydiQA, the issue might be addressed by selecting the appropriate checkpoints. As for MMLU, the reason it scores 1 point lower is unclear to me. It could be due to a large variance in runs 😕

Hope this is helpful, and let me know if you have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants