Logprobs eval #62

chanind · 2024-01-12T18:45:40Z

This PR takes some of the ideas from in #56 and formalizes them in our architecture with tests to make this sort of eval easy moving forward. Specifically, this adds the following:

Adds the awesome make_dataset() changes from @dtch1997 and adds test coverage around the splitting behavior
Adds a Pipeline.calculate_output_logprobs() method based on the code in the the Jupyter notebook in Truthful QA benchmark #56, and adds test coverage.
Adds a MultipleChoiceAccuracyEvaluator which implements the accuracy calculation from the Jupyter notebook using logprobs within our Benchmark framework, including test coverage.
Adds the hardcoded TQA data from the notebook.

This PR also changes our EvalPrediction and Evaluator types to support logprobs. Now, each Evaluator must specify if it requires_generation or requires_probs to indicate to the benchmark what needs to be run. The benchmark will run generation and/or calculate probabilties as required by evaluators.

I also moved the make_dataset() stuff from data/__init__.py into data/make_dataset.py to make it easier to test.

…e questions

dtch1997 · 2024-01-15T11:55:53Z

Overall LGTM! Great work implementing the MCQ-style log prob evaluation as well as expanding the test coverage

porting dataset handling code from tqa branch

0afeff6

chanind requested a review from dtch1997 January 12, 2024 18:45

adding logprob calculation and adding an evaluator for multiple choic…

8a87b90

…e questions

chanind force-pushed the logprobs-eval branch from b83086d to 8a87b90 Compare January 13, 2024 22:09

dtch1997 merged commit 8d07688 into main Jan 15, 2024
2 checks passed

chanind deleted the logprobs-eval branch January 15, 2024 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logprobs eval #62

Logprobs eval #62

chanind commented Jan 12, 2024 •

edited

Loading

dtch1997 commented Jan 15, 2024

Logprobs eval #62

Logprobs eval #62

Conversation

chanind commented Jan 12, 2024 • edited Loading

dtch1997 commented Jan 15, 2024

chanind commented Jan 12, 2024 •

edited

Loading