Enable analyzing evaluators/annotators on data without multiple generator models #293
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, using
alpaca_eval.main.analyze_evaluators
evaluators/annotators can only be analyzed on data (like the original AlpacaEval dataset) that contains texts from more than one generator model. If a dataset only contains a single generator model, the computation of the (Spearman/Pearson) correlation between the winrates of these models under different annotators fails and throws an error. The computation fails because there are not enough values to correlate.This PR makes the correlation computation optional: if the winrate correlation computation fails,
np.nan
values get returned instead and a warning log message is printed. The rest of the computed metrics get returned and no error is thrown. This allows analyzing evaluators on new kinds of data without multiple generator models. Other metrics, such as human agreement, can still be correctly computed in this case (correct me if I am wrong about this).