This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Move code for Mephisto ACUTE-Evals into ParlAI #3002
Merged
Merged
Changes from 9 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
99c532d
Add ACUTE-Eval code
EricMichaelSmith 341a908
Work on README
EricMichaelSmith b9e58e5
README
EricMichaelSmith 799c094
Remove unused bits
EricMichaelSmith f5ee04a
Autoformat
EricMichaelSmith 619c472
Linting
EricMichaelSmith 29b3cc8
Lint
EricMichaelSmith 1c5e3fa
Fix test
EricMichaelSmith 0981cb8
Fix check
EricMichaelSmith 7fbd33c
Jack's comments
EricMichaelSmith 0f62374
Lint
EricMichaelSmith File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# This source code is licensed under the MIT license found in the | ||
# LICENSE file in the root directory of this source tree. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# This source code is licensed under the MIT license found in the | ||
# LICENSE file in the root directory of this source tree. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# ACUTE-Eval | ||
|
||
## Paper information | ||
|
||
Margaret Li, Jason Weston, Stephen Roller. | ||
_[ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons](https://arxiv.org/abs/1909.03087)_. | ||
|
||
## Citation | ||
|
||
If you use this evaluation method in your own work, please cite with the | ||
following BibTex entry: | ||
|
||
@misc{li2019acuteeval, | ||
title={ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons}, | ||
author={Margaret Li and Jason Weston and Stephen Roller}, | ||
year={2019}, | ||
journal={Advances in Neural Information Processing Systems, Conversational AI Workshop}, | ||
url={https://arxiv.org/abs/1909.03087} | ||
} | ||
|
||
# Code Instructions | ||
Once you have installed [ParlAI](https://github.com/facebookresearch/ParlAI/#installing-parlai) and [Mephisto](https://github.com/facebookresearch/mephisto/blob/master/docs/quickstart.md), follow the instructions below. | ||
|
||
The `run.py` script is designed to allow you to run this entire task from command line with an invocation like | ||
|
||
python parlai/crowdsourcing/tasks/acute_eval/example_script.py \ | ||
--pairings-filepath parlai/crowdsourcing/tasks/acute_eval/pairings.jsonl | ||
|
||
## Formatting conversation data | ||
|
||
This task code assumes that you've parsed and saved your collected conversations in a simple .jsonl format. The path to this file should be passed in as `--pairings-filepath`. | ||
|
||
This is a template of the expected format with the minimal expected fields: | ||
|
||
{ | ||
"is_onboarding": false, | ||
"speakers_to_eval": ["first_modelname", "second_modelname"], | ||
"dialogue_ids": [dialogue_1_id, dialogue_2_id], | ||
"dialogue_dicts": [ | ||
{ | ||
"speakers": ["first_modelname", "other_speaker"], | ||
"dialogue": [ | ||
{"id": "model1", "text": "Hi"}, | ||
{"id": "other_speaker", "text": "Hi back"}, | ||
... | ||
] | ||
}, | ||
{ | ||
"speakers": ["other_speaker", "second_modelname"], | ||
"dialogue": [ | ||
{"id": "model1", "text": "Hi"}, | ||
{"id": "other_speaker", "text": "Hi back"}, | ||
... | ||
] | ||
} | ||
] | ||
} | ||
|
||
You can add an `"image_src"` key to an entry of `"dialogue"` to append an image to a chat message. The value of the key should be a serialized image, starting with a string such `data:image/jpeg;base64,`. | ||
|
||
For onboarding tasks (tasks used to filter workers, see below for more details) you must additionally set a `correct_answer` field: | ||
|
||
{ | ||
"is_onboarding": true, | ||
"speakers_to_eval": ["first_modelname", "second_modelname"], | ||
"correct_answer": "correct_modelname", | ||
"dialogue_dicts": [ | ||
# as above | ||
] | ||
} | ||
|
||
Note that we assume that "dialogue" consists of strictly alternating turns (e.g. speakers a, b, a, b, a...). Additionally, `speakers_to_eval` must be in the same order as the dialogue_dicts. See `pairings.jsonl` for examples of the format required. | ||
|
||
## Question phrasing | ||
|
||
In our paper, we address the problem of wording the questions and binary choices in order to elicit the highest signal responses. The default question and choices correspond to our highest signal 'engagingness' phrasing, but it's very easy to customize this by changing `extra_args['eval_question'], extra_args['s1_choice'], extra_args['s2_choice']` in `example_script.py`. The special strings `<Speaker 1>` and `<Speaker 2>` are replaced when showing these questions to the user, and the Speaker's utterances in each conversation will be colored to identify that Speaker. | ||
|
||
|
||
## Onboarding tasks | ||
|
||
As discussed in the paper, we found that we had better annotation quality if we screened Turkers with an 'onboarding' comparison, consisting of a weak baseline conversation and a human-human conversation. Our code is set up so that this is optional. | ||
|
||
By default, `extra_args['block_on_onboarding_fail']` is set to `True`, which means that workers who fail onboarding will be soft-blocked. In other words, they won't be able to see or complete any more hits from you, but won't receive any notification that they've been blocked. The Mechanical Turk qualification name used to soft block must be set with `extra_args['block_qualification']`. | ||
|
||
By setting `extra_args['onboarding_threshold']`, you can also adjust the minimum proportion of onboarding tasks (if you have multiple) that must be answered correctly to pass onboarding. | ||
|
||
|
||
## Other settings | ||
|
||
### Task configuration on MTurk | ||
|
||
The title, description, and keywords of the task as shown on MTurk default to values in `ARG_STRING` in `example_script.py`. These values are used as follows: | ||
- `--task-title`: A short and descriptive title about the kind of task that the HIT contains. On the Amazon Mechanical Turk web site, the HIT title appears in search results and everywhere that the HIT is mentioned. | ||
- `--task-description`: Includes detailed information about the kind of task that the HIT contains. On the Amazon Mechanical Turk web site, the HIT description appears in the expanded view of search results, and in the HIT and assignment screens. | ||
- `--task-tags`: One or more words or phrases that describe the HIT, separated by commas. On MTurk website, these words are used in searches to find HITs. | ||
- `--additional-task-description`: Additional text to show in the left-hand pane of the chat window. | ||
|
||
|
||
### CLI arguments | ||
|
||
A comprehensive list of settings specific to ACUTE-Eval can be found in `add_args_to_group()` in `acute_eval_blueprint.py`. For the arguments most likely to be useful for running ACUTE-Eval, see `example_script.py`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# This source code is licensed under the MIT license found in the | ||
# LICENSE file in the root directory of this source tree. |
51 changes: 51 additions & 0 deletions
51
parlai/crowdsourcing/tasks/acute_eval/acute_eval_agent_state.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# This source code is licensed under the MIT license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
|
||
from typing import List, Dict, Any, TYPE_CHECKING | ||
from mephisto.server.blueprints.abstract.static_task.static_agent_state import ( | ||
StaticAgentState, | ||
) | ||
import time | ||
|
||
if TYPE_CHECKING: | ||
from mephisto.data_model.packet import Packet | ||
|
||
|
||
DATA_FILE = "agent_data.json" | ||
|
||
|
||
class AcuteEvalAgentState(StaticAgentState): | ||
""" | ||
Agent state for acute eval tasks. | ||
|
||
Equivalent to StaticAgentState but doesn't have file IO. | ||
""" | ||
|
||
def get_parsed_data(self) -> List[Dict[str, Any]]: | ||
data = self.get_data() | ||
assert data is not None, "Should only check parsed data for completed tasks" | ||
response_list = [] | ||
inputs: List[Dict[str, Any]] = data["inputs"] | ||
outputs = data["outputs"] | ||
assert inputs is not None | ||
assert outputs is not None | ||
for idx in range(len(inputs)): | ||
entry: Dict[str, Any] = {} | ||
entry.update(inputs[idx]) | ||
entry.update(outputs["final_data"][idx]) | ||
response_list.append(entry) | ||
return response_list | ||
|
||
def update_data(self, packet: "Packet") -> None: | ||
""" | ||
Process the incoming data packet, and handle updating the state. | ||
""" | ||
assert ( | ||
packet.data.get("MEPHISTO_is_submit") is True | ||
), "Static tasks should only have final act" | ||
self.state["times"]["task_end"] = time.time() | ||
self.state["outputs"] = packet.data["task_data"] | ||
self.save_data() |
199 changes: 199 additions & 0 deletions
199
parlai/crowdsourcing/tasks/acute_eval/acute_eval_blueprint.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# This source code is licensed under the MIT license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
|
||
from mephisto.data_model.blueprint import Blueprint | ||
from mephisto.data_model.assignment import InitializationData | ||
from parlai.crowdsourcing.tasks.acute_eval.acute_eval_agent_state import ( | ||
AcuteEvalAgentState, | ||
) | ||
from parlai.crowdsourcing.tasks.acute_eval.acute_eval_runner import AcuteEvalRunner | ||
from parlai.crowdsourcing.tasks.acute_eval.acute_eval_builder import AcuteEvalBuilder | ||
from mephisto.core.registry import register_mephisto_abstraction | ||
|
||
import os | ||
import math | ||
|
||
from typing import ClassVar, List, Type, Any, Dict, Iterable, TYPE_CHECKING | ||
|
||
if TYPE_CHECKING: | ||
from mephisto.data_model.blueprint import AgentState, TaskRunner, TaskBuilder | ||
from argparse import _ArgumentGroup as ArgumentGroup | ||
|
||
BLUEPRINT_TYPE = "acute_eval" | ||
|
||
|
||
# WISH AcuteEval's blueprint can probably be extended to compare more than just convos | ||
@register_mephisto_abstraction() | ||
class AcuteEvalBlueprint(Blueprint): | ||
""" | ||
Blueprint for a task that asks humans to compare conversational outputs. | ||
""" | ||
|
||
AgentStateClass: ClassVar[Type["AgentState"]] = AcuteEvalAgentState | ||
TaskBuilderClass: ClassVar[Type["TaskBuilder"]] = AcuteEvalBuilder | ||
TaskRunnerClass: ClassVar[Type["TaskRunner"]] = AcuteEvalRunner | ||
supported_architects: ClassVar[List[str]] = ["mock"] # TODO update | ||
BLUEPRINT_TYPE = BLUEPRINT_TYPE | ||
|
||
@classmethod | ||
def assert_task_args(cls, opts: Any) -> None: | ||
""" | ||
Ensure that the data can be properly loaded. | ||
""" | ||
if opts.get("pairings_filepath") is not None: | ||
pairings_filepath = os.path.expanduser(opts["pairings_filepath"]) | ||
assert os.path.exists( | ||
pairings_filepath | ||
), f"Provided file {pairings_filepath} doesn't exist" | ||
elif opts.get("pairings_task_data") is not None: | ||
assert ( | ||
len(opts.get("pairings_task_data")) > 0 | ||
), "Length of data dict provided was 0" | ||
else: | ||
raise AssertionError( | ||
"Must provide one of a data csv, json, or a list of tasks" | ||
) | ||
|
||
if opts.get("block_on_onboarding_fail") is True: | ||
if opts.get("block_qualification") is None: | ||
raise AssertionError( | ||
"Must provide `block_qualification` to use `block_on_onboarding_fail`" | ||
) | ||
|
||
@classmethod | ||
def add_args_to_group(cls, group: "ArgumentGroup") -> None: | ||
""" | ||
Adds required options for AcuteEvalBlueprints. | ||
|
||
task_source points to the file intending to be deployed for this task | ||
pairings_filepath has the data to be deployed for this task. | ||
""" | ||
super(AcuteEvalBlueprint, cls).add_args_to_group(group) | ||
|
||
group.description = """ | ||
AcuteEvalBlueprint: Tasks launched from acute eval blueprints | ||
require sets of pairings for workers to be able to compare to. | ||
|
||
These pairings can be provided as a csv or by passing a | ||
pairings_task_data dict into extra_args. | ||
""" | ||
group.add_argument( | ||
"--annotations-per-pair", | ||
dest="annotations_per_pair", | ||
type=int, | ||
default=1, | ||
help="Number of annotations per conversation comparison pair", | ||
) | ||
group.add_argument( | ||
"--pairings-filepath", | ||
dest="pairings_filepath", | ||
type=str, | ||
default=None, | ||
help="path to the file containing the task dictionaries", | ||
) | ||
# group.add_argument( | ||
# '--task-config', | ||
# type=dict, | ||
# default=DEFAULT_TASK_CONFIG, | ||
# help='dict with keys "hit_title", "hit_description", "hit_keywords", ' | ||
# 'determining how task is displayed on MTurk site', | ||
# ) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: can remove this arg |
||
group.add_argument( | ||
"--s1-choice", | ||
dest="s1_choice", | ||
type=str, | ||
default="I would prefer to talk to <Speaker 1>", | ||
help="text next to speaker 1 radio button", | ||
) | ||
group.add_argument( | ||
"--s2-choice", | ||
dest="s2_choice", | ||
type=str, | ||
default="I would prefer to talk to <Speaker 2>", | ||
help="text next to speaker 2 radio button", | ||
) | ||
group.add_argument( | ||
"--eval-question", | ||
dest="eval_question", | ||
type=str, | ||
default="Who would you prefer to talk to for a long conversation?", | ||
help='question to present to turker for comparison (e.g. "Which speaker is better?")', | ||
) | ||
group.add_argument( | ||
"--block-on-onboarding-fail", | ||
dest="block_on_onboarding_fail", | ||
type=bool, | ||
default=True, | ||
help="whether to block on onboarding failure", | ||
) | ||
group.add_argument( | ||
"--subtasks-per-unit", | ||
dest="subtasks_per_unit", | ||
type=int, | ||
default=5, | ||
help="number of subtasks/comparisons to do per unit", | ||
) | ||
group.add_argument( | ||
"--onboarding-threshold", | ||
dest="onboarding_threshold", | ||
type=float, | ||
default=0.75, | ||
help="minimum accuracy on onboarding tasks, as a float 0-1.0", | ||
) | ||
group.add_argument( | ||
"--random-seed", | ||
dest="random_seed", | ||
type=int, | ||
default=42, | ||
help="seed for random", | ||
) | ||
# group.add_argument( | ||
# '--softblock-list-path', | ||
# dest="softblock_list_path", | ||
# type=str, | ||
# default=None, | ||
# help='Path to list of workers to softblock, separated by line breaks', | ||
# ) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: this one as well |
||
group.add_argument( | ||
"--additional-task-description", | ||
dest="additional_task_description", | ||
type=str, | ||
default='', | ||
help="Additional text to show on the left pane", | ||
) | ||
return | ||
|
||
def get_frontend_args(self) -> Dict[str, Any]: | ||
""" | ||
Specifies what options within a task_config should be forwarded to the client | ||
for use by the task's frontend. | ||
""" | ||
return { | ||
"task_description": "Placeholder Task Description - Javascript failed to load", | ||
"frame_height": 650, | ||
"num_subtasks": self.opts["subtasks_per_unit"], | ||
"question": self.opts["eval_question"], | ||
"block_mobile": True, | ||
"get_task_feedback": False, # TODO(#95) make option | ||
"additional_task_description": self.opts['additional_task_description'], | ||
} | ||
|
||
def get_initialization_data(self) -> Iterable["InitializationData"]: | ||
""" | ||
Return the InitializationData retrieved from the specified stream. | ||
""" | ||
# TODO(#99) once we can release HITs over time, configure this to | ||
# release as many as needed thusfar and top off when | ||
# onboardings fail | ||
print(self.opts) | ||
num_conversations = math.ceil( | ||
self.opts.get("num_matchup_pairs", 8) | ||
/ max((self.opts["subtasks_per_unit"] - 1), 1) | ||
) # release enough hits to finish all annotations requested | ||
return [ | ||
InitializationData(shared={}, unit_data=[{}]) | ||
for d in range(num_conversations) | ||
] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may make sense to rename
example_script.py
, or update this comment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops - yes, changing