Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Add PW-Turn evaluation metrics and model comparisons to documentation #4362

Merged
merged 3 commits into from
Feb 22, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,36 @@ To change the worker selection criteria for onboarding, see `handleOnboardingSub

## Analysis
Run `analysis/compile_results.py` to compile and save statistics about collected human+model chats. Set `--results-folders` to the value of `mephisto.blueprint.chat_data_folder` used when running HITs. Specifically, the analysis file:
- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here)
- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here)

## Reproducing Paper Results
The following section contains instructions to reproduce the results in our paper. In our paper, we run 3 sets of model comparisons:

### Model Comparisons
To run a pairwise model comparison annotation task, create a `.yaml` config using the template provided by `hydra_configs/conf/example_model_comparison.yaml`. Set the models and number of conversations to collect ratings for in the `mephisto.blueprint.conversations_needed_string` field, following the format `${model_A}:${model_B}:${num_conversations}`. For example, `"blender_90M:blender_3B:10"` compares the `blender_90M` model with the `blender_3B` model and collects 10 conversations.

Here are the model comparisons ran in the paper, and corresponding values for `conversations_needed_string`:
- Size (BlenderBot3B vs. BlenderBot90M): `"blender_90M:blender_3B:60"`
- Generation Length (BlenderBot3B vs. BlenderBot3B-M0): `"blender_3B:blender_3B_beam_min_length_0:60"`
- Fine-tuning (BlenderBot3B vs. Reddit3B): `"blender_3B:reddit_3B:60"`

To run a crowdsourcing task, run the following with your modified parameters:
```
CONF=example_model_comparison && # Replace with your conf
REQUESTER_NAME=mturk_sandbox && # Replace with your Mephisto requester
python run.py \
conf=${CONF} \
mephisto.provider.requester_name=${REQUESTER_NAME}
```

As described above, you can also set config fields directly in the command line.

### Evaluation Metric
To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.task_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding task questions:
- Engagingness: “Which next response from your partner would you prefer in a long conversation?”
- Humanness: "Which next response from your partner sounds more human?”
- Interestingness: “If you had to say one of these responses is interesting and one is boring, which would you say is more interesting?”

You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. We provide examples for all 3 eval metrics described above in the `task_config/` folder. The example provided in `task_config/onboard_task_data__engaging.json` requires users to select the most engaging response. To change the question asked during onboarding, set `mephisto.blueprint.annotation_question`.

We recommend modifying `mephisto.task.task_name` to describe the run parameters, such as the models being compared, and the evaluation metric.
EricMichaelSmith marked this conversation as resolved.
Show resolved Hide resolved