-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Human baseline #3
Comments
Nope, but you can use 60 out f 100 score as a reference score, which is the passing score for the Med Exam. |
Thank you for the info! So, just to make sure we are aligned here: does that means 60% answering accuracy for the US, TW and MC datasets? |
Yeap, 60% of accuracy can be considered as the human passing score. For US dataset, this score is quite hard. |
Can you elaborate on your reasoning here please? It seems a somewhat flawed heuristic to assume human performance on this particular QA dataset [which to my understanding is generated?] would be approximately equivalent to the outcome of humans being tested in actual exams. |
Hi! Is there a known human baseline for this dataset (open and closed book)? Or maybe the required score to pass the exam?
The text was updated successfully, but these errors were encountered: