-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about accuracy #12
Comments
Hi -- Thanks for the interest in our work, and try out the experiments!
Firstly, I would like to point out that Table 5 does not present the base model performance of
You can see that this new run still does not fully reproduce the results reported in the paper. Given that the results I ran now and reported in the paper were on different hardware and envs that I don’t have access to anymore, I am not sure what could be the problem. Also, it's worth noting that subtle variations in batch size can affect evaluation results, as discussed in this known issue. Despite this, there remains a discrepancy between my latest run and your results. I have uploaded the code I used to obtain these results here. I hope this will help you in reproducing my results.
The main issue seems to be on TydiQA. In retrospect, we realized that for TydiQA, we selected the third checkpoints instead of the last because we observed a significant performance degradation after the fourth epoch. This decision is consistent with our approach for random selection, hence we reported the results as of epoch=3 in the paper. The full results are presented below. Ideally, a validation set should be used to determine the optimal checkpoints. To make this process more rigorous, one might consider forming a validation set from the training data of this dataset. However, we found that using just 9 examples as a validation set was too noisy for TydiQA.
For TydiQA, the issue might be addressed by selecting the appropriate checkpoints. As for MMLU, the reason it scores 1 point lower is unclear to me. It could be due to a large variance in runs 😕 Hope this is helpful, and let me know if you have further questions! |
Hello, I have some questions about the accuracy of llama2-7b.
In the Table 5, the accuracy of llama2-7b-base on MMLU/TYDIQA/BBH are 46.7/52.1/39.8, but we use llama2-7b from "https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main" to test as 46.0/42.5/40.4, why is it so different from the table?
Also, we trained using the provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, the results of our reproduction are 49.3/54.0/42.3, lower than 50.2/56.2/41.5 in the table.
Can you kindly explain it to me? Thanks!
The text was updated successfully, but these errors were encountered: