Full dataset to compare two LLMs #10

xiuzbl · 2023-05-13T05:58:32Z

I notice that there are only 10 examples listed in the 'data/pipeline-sanity-check.json'. I think this may not be the full dataset for evaluation. When will you provide the complete one in this repository? Thank you!

zhuohaoyu · 2023-05-13T06:25:07Z

As described here(https://github.com/WeOpenML/PandaLM#test-data), The test data is available in ./data/testset-v1.json. We also release the test results of gpt-3.5-turbo and PandaLM-7B in ./data/gpt-3.5-turbo-testset-v1.json and ./data/pandalm-7b-testset-v1.json.

xiuzbl · 2023-06-02T08:24:25Z

I noticed that evaluation samples listed in 'data/testset-inference-v1.json' are repeated multiple times in 'data/testset-v1.json'. May you explain the reason for such usage? If we want to report PandaLM results when comparing two LLMs, which test-data-file should we use more correctly?

zhuohaoyu · 2023-06-02T08:38:54Z

I noticed that evaluation samples listed in 'data/testset-inference-v1.json' are repeated multiple times in 'data/testset-v1.json'. May you explain the reason for such usage? If we want to report PandaLM results when comparing two LLMs, which test-data-file should we use more correctly?

Thank you for your interest in PandaLM. 'data/testset-inference-v1.json' is the correct choice in your case, as this file contains unique (instruction, input) pairs for inferencing on your tuned models. data/testset-v1.json contains human annotated (instruction, input, response1, response2, label) tuples, and it's intended for validating the evaluation ability of PandaLM. The responses are generated by different models given the same context, thus the (instruction, input) pair in the tuples may repeat.

zhuohaoyu closed this as completed May 13, 2023

zhuohaoyu reopened this Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full dataset to compare two LLMs #10

Full dataset to compare two LLMs #10

xiuzbl commented May 13, 2023

zhuohaoyu commented May 13, 2023

xiuzbl commented Jun 2, 2023

zhuohaoyu commented Jun 2, 2023

Full dataset to compare two LLMs #10

Full dataset to compare two LLMs #10

Comments

xiuzbl commented May 13, 2023

zhuohaoyu commented May 13, 2023

xiuzbl commented Jun 2, 2023

zhuohaoyu commented Jun 2, 2023