facebookresearch · Rebecca-Qian · Feb 22, 2022 · Feb 15, 2022 · Feb 18, 2022 · Feb 18, 2022
diff --git a/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md b/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md
@@ -36,4 +36,36 @@ To change the worker selection criteria for onboarding, see `handleOnboardingSub
 
 ## Analysis
 Run `analysis/compile_results.py` to compile and save statistics about collected human+model chats. Set `--results-folders` to the value of `mephisto.blueprint.chat_data_folder` used when running HITs. Specifically, the analysis file:
-- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here)
+- Has most of the features from `parlai/crowdsourcing/tasks/model_chat`'s analysis script (doesn't include analysis of annotation buckets, since it isn't used here)
+
+## Reproducing Paper Results
+The following section contains instructions to reproduce the results in our paper. In our paper, we run 3 sets of model comparisons:
+
+### Model Comparisons
+To run a pairwise model comparison annotation task, create a `.yaml` config using the template provided by `hydra_configs/conf/example_model_comparison.yaml`. Set the models and number of conversations to collect ratings for in the `mephisto.blueprint.conversations_needed_string` field, following the format `${model_A}:${model_B}:${num_conversations}`. For example, `"blender_90M:blender_3B:10"` compares the `blender_90M` model with the `blender_3B` model and collects 10 conversations.
+
+Here are the model comparisons ran in the paper, and corresponding values for `conversations_needed_string`:
+- Size (BlenderBot3B vs. BlenderBot90M): `"blender_90M:blender_3B:60"`
+- Generation Length (BlenderBot3B vs. BlenderBot3B-M0): `"blender_3B:blender_3B_beam_min_length_0:60"`
+- Fine-tuning (BlenderBot3B vs. Reddit3B): `"blender_3B:reddit_3B:60"`
+
+To run a crowdsourcing task, run the following with your modified parameters:
+```
+CONF=example_model_comparison && # Replace with your conf
+REQUESTER_NAME=mturk_sandbox && # Replace with your Mephisto requester
+python run.py \
+conf=${CONF} \
+mephisto.provider.requester_name=${REQUESTER_NAME}
+```
+
+As described above, you can also set config fields directly in the command line.
+
+### Evaluation Metric
+To change the metric that annotators use to select the better conversational response, change the `mephisto.blueprint.task_question` field to best reflect the evaluation metric you want to use. These are the metrics we used in our paper, and the corresponding task questions:
+- Engagingness: “Which next response from your partner would you prefer in a long conversation?”
+- Humanness: "Which next response from your partner sounds more human?”
+- Interestingness: “If you had to say one of these responses is interesting and one is boring, which would you say is more interesting?”
+
+You can also change the onboarding task to better reflect your evaluation metric. To do this, create a `.json` file with onboarding task questions and correct responses, and set `mephisto.blueprint.onboard_task_data_path` in the config to that filepath. We provide examples for all 3 eval metrics described above in the `task_config/` folder. The example provided in `task_config/onboard_task_data__engaging.json` requires users to select the most engaging response. To change the question asked during onboarding, set `mephisto.blueprint.annotation_question`.
+
+We recommend modifying `mephisto.task.task_name` to describe the run parameters, such as the models being compared, and the evaluation metric.