From c7071622770d20106cbf2ec26a37fca31d3db34b Mon Sep 17 00:00:00 2001 From: Rebecca-Qian Date: Tue, 15 Feb 2022 15:59:08 -0800 Subject: [PATCH 1/3] Add PW-Turn evaluation metrics and model comparisons to documentation --- .../tasks/pairwise_per_turn_eval/README.md | 34 ++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md index a247d686eb8..a28aa6a75b5 100644 --- a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md +++ b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md @@ -36,4 +36,36 @@ To change the worker selection criteria for onboarding, see `handleOnboardingSub ## Analysis Run `analysis/compile_results.py` to compile and save statistics about collected human+model chats. Set `--results-folders` to the value of `mephisto.blueprint.chat_data_folder` used when running HITs. Specifically, the analysis file: -- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here) \ No newline at end of file +- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here) + +## Reproducing Paper Results +The following section contains instructions to reproduce the results in our paper. In our paper, we run 3 sets of model comparisons: + +### Model Comparisons +To run a pairwise model comparison annotation task, create a `.yaml` config using the template provided by `hydra_configs/conf/example_model_comparison.yaml`. Set the models and number of conversations to collect ratings for in the `mephisto.blueprint.conversations_needed_string` field, following the format `${model_A}:${model_B}:${num_conversations}`. For example, `"blender_90M:blender_3B:10"` compares the `blender_90M` model with the `blender_3B` model and collects 10 conversations. + +Here are the model comparisons ran in the paper, and corresponding values for `conversations_needed_string`: +- Size (BlenderBot3B vs. BlenderBot90M): `"blender_90M:blender_3B:60"` +- Generation Length (BlenderBot3B vs. BlenderBot3B-0): `"blender_3B:blender_3B_beam_min_length_0:60"` +- Fine-tuning (BlenderBot3B vs. Reddit3B): `"blender_3B:reddit_3B:60"` + +To run a crowdsourcing task, run the following with your modified parameters: +``` +CONF=example_model_comparison && # Replace with your conf +REQUESTER_NAME=mturk_sandbox && # Replace with your Mephisto requester +python run.py \ +conf=${CONF} \ +mephisto.provider.requester_name=${REQUESTER_NAME} +``` + +As described above, you can also set config fields directly in the command line. + +### Evaluation Metric +To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.annotation_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding annotation questions: +- Engagingness: “Which next response from your partner would you prefer in a long conversation?” +- Humanness: "Which next response from your partner sounds more human?” +- Interestingness: “If you had to say one of these responses is interesting and one is boring, which would you say is more interesting?” + +You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. The example provided in `onboard_task_data__engaging.json` requires users to select the most engaging response. + +We recommend modifying `mephisto.task.task_name` to describe the run parameters, such as the models being compared, and the evaluation metric. \ No newline at end of file From 116088d56b61c11df439d2f21f61a1d67f041402 Mon Sep 17 00:00:00 2001 From: Rebecca-Qian Date: Thu, 17 Feb 2022 22:15:27 -0800 Subject: [PATCH 2/3] Update PR with feedback --- parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md index a28aa6a75b5..b70449a4f6c 100644 --- a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md +++ b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md @@ -61,11 +61,11 @@ mephisto.provider.requester_name=${REQUESTER_NAME} As described above, you can also set config fields directly in the command line. ### Evaluation Metric -To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.annotation_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding annotation questions: +To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.task_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding task questions: - Engagingness: “Which next response from your partner would you prefer in a long conversation?” - Humanness: "Which next response from your partner sounds more human?” - Interestingness: “If you had to say one of these responses is interesting and one is boring, which would you say is more interesting?” -You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. The example provided in `onboard_task_data__engaging.json` requires users to select the most engaging response. +You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. We provide examples for all 3 eval metrics described above in the `task_config/` folder. The example provided in `task_config/onboard_task_data__engaging.json` requires users to select the most engaging response. To change the question asked during onboarding, set `mephisto.blueprint.annotation_question`. We recommend modifying `mephisto.task.task_name` to describe the run parameters, such as the models being compared, and the evaluation metric. \ No newline at end of file From e4e228ed142ba92d1566d0e2b04a6ee267c80971 Mon Sep 17 00:00:00 2001 From: Rebecca-Qian Date: Thu, 17 Feb 2022 22:16:29 -0800 Subject: [PATCH 3/3] Correct name for BB-0 --- parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md index b70449a4f6c..4d5ee371d36 100644 --- a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md +++ b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md @@ -46,7 +46,7 @@ To run a pairwise model comparison annotation task, create a `.yaml` config usin Here are the model comparisons ran in the paper, and corresponding values for `conversations_needed_string`: - Size (BlenderBot3B vs. BlenderBot90M): `"blender_90M:blender_3B:60"` -- Generation Length (BlenderBot3B vs. BlenderBot3B-0): `"blender_3B:blender_3B_beam_min_length_0:60"` +- Generation Length (BlenderBot3B vs. BlenderBot3B-M0): `"blender_3B:blender_3B_beam_min_length_0:60"` - Fine-tuning (BlenderBot3B vs. Reddit3B): `"blender_3B:reddit_3B:60"` To run a crowdsourcing task, run the following with your modified parameters: