-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full dataset to compare two LLMs #10
Comments
As described here(https://github.com/WeOpenML/PandaLM#test-data), The test data is available in ./data/testset-v1.json. We also release the test results of gpt-3.5-turbo and PandaLM-7B in ./data/gpt-3.5-turbo-testset-v1.json and ./data/pandalm-7b-testset-v1.json. |
I noticed that evaluation samples listed in 'data/testset-inference-v1.json' are repeated multiple times in 'data/testset-v1.json'. May you explain the reason for such usage? If we want to report PandaLM results when comparing two LLMs, which test-data-file should we use more correctly? |
Thank you for your interest in PandaLM. 'data/testset-inference-v1.json' is the correct choice in your case, as this file contains unique (instruction, input) pairs for inferencing on your tuned models. |
I notice that there are only 10 examples listed in the 'data/pipeline-sanity-check.json'. I think this may not be the full dataset for evaluation. When will you provide the complete one in this repository? Thank you!
The text was updated successfully, but these errors were encountered: