Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Add documentation for running/analyzing various human evals #4352

Merged
merged 1 commit into from
Feb 16, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions projects/humaneval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,46 @@ At the heart of improving conversational AI is the open problem of how to evalua

[Link](https://arxiv.org/abs/2201.04723)

## Performing evaluations

### Pairwise per-turn evaluations (PW-Turn)

See the [PW-Turn README](https://github.com/facebookresearch/ParlAI/blob/main/parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md) for running and analyzing pairwise per-turn evaluations.

### Pairwise per-dialogue evaluations (PW-Dialog)

See the "Fast-ACUTE" section of the [Acute-Eval README](https://github.com/facebookresearch/ParlAI/blob/main/parlai/crowdsourcing/tasks/acute_eval/README.md) for running and analyzing pairwise per-dialogue evaluations, referred to as PW-Dialog evaluations in this paper. Specify the evaluation metric with `mephisto.blueprint.acute_eval_type={engaging,human,interesting}`.

To specify the models to compare, pass in the arguments
```
mephisto.blueprint.config_path=${CONFIG_PATH} \
mephisto.blueprint.model_pairs=\'${MODEL_NAME_1}:${MODEL_NAME_2}\'
```
where `${CONFIG_PATH}` points to a JSON file that defines the configurations of the two models `${MODEL_NAME_1}` and `${MODEL_NAME_2}`. (See the [Acute-Eval `task_config/` folder](https://github.com/facebookresearch/ParlAI/tree/main/parlai/crowdsourcing/tasks/acute_eval/task_config) for examples of such JSON files for self-chats and for existing human+model chat logs.) For PW-Dialog self-chats, the following settings were used in the JSON config file:
```
{
"model_file": MODEL_FILE_STRING,
"model": "transformer/generator",
"beam_min_length": 20, # Or 0 for BlenderBot3B-M0
"batchsize": 1,
"skip_generation": false,
"interactive_mode": false,
"beam_size": 10,
"inference": "beam",
"beam_block_ngram": 3,
"beam_context_block_ngram": 3,
"beam_block_full_context": false
},
```
`MODEL_FILE_STRING` was set to the following value for each model evaluated:
- **BlenderBot3B** and **BlenderBot3B-M0**: `"zoo:blender/blender_3B/model"`
- **BlenderBot90M**: `"zoo:blender/blender_90M/model"`
- **Reddit3B**: `"zoo:blender/reddit_3B/model"`

### Single-model evaluations (SM-Turn and SM-Dialog)

See the [SM-Turn/SM-Dialog README](https://github.com/facebookresearch/ParlAI/blob/main/parlai/crowdsourcing/projects/humaneval/single_model_eval/README.md) for running and analyzing single-model evaluations.

## Citation

If you use the dataset or models in your own work, please cite with the
Expand Down