From c7071622770d20106cbf2ec26a37fca31d3db34b Mon Sep 17 00:00:00 2001
From: Rebecca-Qian <rebeccaqian@fb.com>
Date: Tue, 15 Feb 2022 15:59:08 -0800
Subject: [PATCH 1/3] Add PW-Turn evaluation metrics and model comparisons to
 documentation

---
 .../tasks/pairwise_per_turn_eval/README.md    | 34 ++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
index a247d686eb8..a28aa6a75b5 100644
--- a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
+++ b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
@@ -36,4 +36,36 @@ To change the worker selection criteria for onboarding, see `handleOnboardingSub
 
 ## Analysis
 Run `analysis/compile_results.py` to compile and save statistics about collected human+model chats. Set `--results-folders` to the value of `mephisto.blueprint.chat_data_folder` used when running HITs. Specifically, the analysis file:
-- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here)
\ No newline at end of file
+- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here)
+
+## Reproducing Paper Results
+The following section contains instructions to reproduce the results in our paper. In our paper, we run 3 sets of model comparisons:
+
+### Model Comparisons
+To run a pairwise model comparison annotation task, create a `.yaml` config using the template provided by `hydra_configs/conf/example_model_comparison.yaml`. Set the models and number of conversations to collect ratings for in the `mephisto.blueprint.conversations_needed_string` field, following the format `${model_A}:${model_B}:${num_conversations}`. For example, `"blender_90M:blender_3B:10"` compares the `blender_90M` model with the `blender_3B` model and collects 10 conversations.
+
+Here are the model comparisons ran in the paper, and corresponding values for `conversations_needed_string`:
+- Size (BlenderBot3B vs. BlenderBot90M): `"blender_90M:blender_3B:60"`
+- Generation Length (BlenderBot3B vs. BlenderBot3B-0): `"blender_3B:blender_3B_beam_min_length_0:60"`
+- Fine-tuning (BlenderBot3B vs. Reddit3B): `"blender_3B:reddit_3B:60"`
+
+To run a crowdsourcing task, run the following with your modified parameters:
+```
+CONF=example_model_comparison && # Replace with your conf
+REQUESTER_NAME=mturk_sandbox && # Replace with your Mephisto requester
+python run.py \
+conf=${CONF} \
+mephisto.provider.requester_name=${REQUESTER_NAME}
+```
+
+As described above, you can also set config fields directly in the command line.
+
+### Evaluation Metric
+To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.annotation_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding annotation questions:
+- Engagingness: “Which next response from your partner would you prefer in a long conversation?”
+- Humanness: "Which next response from your partner sounds more human?”
+- Interestingness: “If you had to say one of these responses is interesting and one is boring, which would you say is more interesting?”
+
+You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. The example provided in `onboard_task_data__engaging.json` requires users to select the most engaging response.
+
+We recommend modifying `mephisto.task.task_name` to describe the run parameters, such as the models being compared, and the evaluation metric.
\ No newline at end of file

From 116088d56b61c11df439d2f21f61a1d67f041402 Mon Sep 17 00:00:00 2001
From: Rebecca-Qian <rebeccaqian@fb.com>
Date: Thu, 17 Feb 2022 22:15:27 -0800
Subject: [PATCH 2/3] Update PR with feedback

---
 parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
index a28aa6a75b5..b70449a4f6c 100644
--- a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
+++ b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
@@ -61,11 +61,11 @@ mephisto.provider.requester_name=${REQUESTER_NAME}
 As described above, you can also set config fields directly in the command line.
 
 ### Evaluation Metric
-To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.annotation_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding annotation questions:
+To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.task_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding task questions:
 - Engagingness: “Which next response from your partner would you prefer in a long conversation?”
 - Humanness: "Which next response from your partner sounds more human?”
 - Interestingness: “If you had to say one of these responses is interesting and one is boring, which would you say is more interesting?”
 
-You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. The example provided in `onboard_task_data__engaging.json` requires users to select the most engaging response.
+You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. We provide examples for all 3 eval metrics described above in the `task_config/` folder. The example provided in `task_config/onboard_task_data__engaging.json` requires users to select the most engaging response. To change the question asked during onboarding, set `mephisto.blueprint.annotation_question`.
 
 We recommend modifying `mephisto.task.task_name` to describe the run parameters, such as the models being compared, and the evaluation metric.
\ No newline at end of file

From e4e228ed142ba92d1566d0e2b04a6ee267c80971 Mon Sep 17 00:00:00 2001
From: Rebecca-Qian <rebeccaqian@fb.com>
Date: Thu, 17 Feb 2022 22:16:29 -0800
Subject: [PATCH 3/3] Correct name for BB-0

---
 parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
index b70449a4f6c..4d5ee371d36 100644
--- a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
+++ b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
@@ -46,7 +46,7 @@ To run a pairwise model comparison annotation task, create a `.yaml` config usin
 
 Here are the model comparisons ran in the paper, and corresponding values for `conversations_needed_string`:
 - Size (BlenderBot3B vs. BlenderBot90M): `"blender_90M:blender_3B:60"`
-- Generation Length (BlenderBot3B vs. BlenderBot3B-0): `"blender_3B:blender_3B_beam_min_length_0:60"`
+- Generation Length (BlenderBot3B vs. BlenderBot3B-M0): `"blender_3B:blender_3B_beam_min_length_0:60"`
 - Fine-tuning (BlenderBot3B vs. Reddit3B): `"blender_3B:reddit_3B:60"`
 
 To run a crowdsourcing task, run the following with your modified parameters: