Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full dataset to compare two LLMs #10

Open
xiuzbl opened this issue May 13, 2023 · 3 comments
Open

Full dataset to compare two LLMs #10

xiuzbl opened this issue May 13, 2023 · 3 comments

Comments

@xiuzbl
Copy link

xiuzbl commented May 13, 2023

I notice that there are only 10 examples listed in the 'data/pipeline-sanity-check.json'. I think this may not be the full dataset for evaluation. When will you provide the complete one in this repository? Thank you!

@zhuohaoyu
Copy link
Member

As described here(https://github.com/WeOpenML/PandaLM#test-data), The test data is available in ./data/testset-v1.json. We also release the test results of gpt-3.5-turbo and PandaLM-7B in ./data/gpt-3.5-turbo-testset-v1.json and ./data/pandalm-7b-testset-v1.json.

@xiuzbl
Copy link
Author

xiuzbl commented Jun 2, 2023

I noticed that evaluation samples listed in 'data/testset-inference-v1.json' are repeated multiple times in 'data/testset-v1.json'. May you explain the reason for such usage? If we want to report PandaLM results when comparing two LLMs, which test-data-file should we use more correctly?

@zhuohaoyu
Copy link
Member

I noticed that evaluation samples listed in 'data/testset-inference-v1.json' are repeated multiple times in 'data/testset-v1.json'. May you explain the reason for such usage? If we want to report PandaLM results when comparing two LLMs, which test-data-file should we use more correctly?

Thank you for your interest in PandaLM. 'data/testset-inference-v1.json' is the correct choice in your case, as this file contains unique (instruction, input) pairs for inferencing on your tuned models. data/testset-v1.json contains human annotated (instruction, input, response1, response2, label) tuples, and it's intended for validating the evaluation ability of PandaLM. The responses are generated by different models given the same context, thus the (instruction, input) pair in the tuples may repeat.

@zhuohaoyu zhuohaoyu reopened this Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants