Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Move code for Mephisto ACUTE-Evals into ParlAI #3002

Merged
merged 11 commits into from
Aug 25, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions parlai/crowdsourcing/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
5 changes: 5 additions & 0 deletions parlai/crowdsourcing/tasks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
101 changes: 101 additions & 0 deletions parlai/crowdsourcing/tasks/acute_eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# ACUTE-Eval

## Paper information

Margaret Li, Jason Weston, Stephen Roller.
_[ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons](https://arxiv.org/abs/1909.03087)_.

## Citation

If you use this evaluation method in your own work, please cite with the
following BibTex entry:

@misc{li2019acuteeval,
title={ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons},
author={Margaret Li and Jason Weston and Stephen Roller},
year={2019},
journal={Advances in Neural Information Processing Systems, Conversational AI Workshop},
url={https://arxiv.org/abs/1909.03087}
}

# Code Instructions
Once you have installed [ParlAI](https://github.com/facebookresearch/ParlAI/#installing-parlai) and [Mephisto](https://github.com/facebookresearch/mephisto/blob/master/docs/quickstart.md), follow the instructions below.

The `run.py` script is designed to allow you to run this entire task from command line with an invocation like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to rename example_script.py, or update this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops - yes, changing


python parlai/crowdsourcing/tasks/acute_eval/example_script.py \
--pairings-filepath parlai/crowdsourcing/tasks/acute_eval/pairings.jsonl

## Formatting conversation data

This task code assumes that you've parsed and saved your collected conversations in a simple .jsonl format. The path to this file should be passed in as `--pairings-filepath`.

This is a template of the expected format with the minimal expected fields:

{
"is_onboarding": false,
"speakers_to_eval": ["first_modelname", "second_modelname"],
"dialogue_ids": [dialogue_1_id, dialogue_2_id],
"dialogue_dicts": [
{
"speakers": ["first_modelname", "other_speaker"],
"dialogue": [
{"id": "model1", "text": "Hi"},
{"id": "other_speaker", "text": "Hi back"},
...
]
},
{
"speakers": ["other_speaker", "second_modelname"],
"dialogue": [
{"id": "model1", "text": "Hi"},
{"id": "other_speaker", "text": "Hi back"},
...
]
}
]
}

You can add an `"image_src"` key to an entry of `"dialogue"` to append an image to a chat message. The value of the key should be a serialized image, starting with a string such `data:image/jpeg;base64,`.

For onboarding tasks (tasks used to filter workers, see below for more details) you must additionally set a `correct_answer` field:

{
"is_onboarding": true,
"speakers_to_eval": ["first_modelname", "second_modelname"],
"correct_answer": "correct_modelname",
"dialogue_dicts": [
# as above
]
}

Note that we assume that "dialogue" consists of strictly alternating turns (e.g. speakers a, b, a, b, a...). Additionally, `speakers_to_eval` must be in the same order as the dialogue_dicts. See `pairings.jsonl` for examples of the format required.

## Question phrasing

In our paper, we address the problem of wording the questions and binary choices in order to elicit the highest signal responses. The default question and choices correspond to our highest signal 'engagingness' phrasing, but it's very easy to customize this by changing `extra_args['eval_question'], extra_args['s1_choice'], extra_args['s2_choice']` in `example_script.py`. The special strings `<Speaker 1>` and `<Speaker 2>` are replaced when showing these questions to the user, and the Speaker's utterances in each conversation will be colored to identify that Speaker.


## Onboarding tasks

As discussed in the paper, we found that we had better annotation quality if we screened Turkers with an 'onboarding' comparison, consisting of a weak baseline conversation and a human-human conversation. Our code is set up so that this is optional.

By default, `extra_args['block_on_onboarding_fail']` is set to `True`, which means that workers who fail onboarding will be soft-blocked. In other words, they won't be able to see or complete any more hits from you, but won't receive any notification that they've been blocked. The Mechanical Turk qualification name used to soft block must be set with `extra_args['block_qualification']`.

By setting `extra_args['onboarding_threshold']`, you can also adjust the minimum proportion of onboarding tasks (if you have multiple) that must be answered correctly to pass onboarding.


## Other settings

### Task configuration on MTurk

The title, description, and keywords of the task as shown on MTurk default to values in `ARG_STRING` in `example_script.py`. These values are used as follows:
- `--task-title`: A short and descriptive title about the kind of task that the HIT contains. On the Amazon Mechanical Turk web site, the HIT title appears in search results and everywhere that the HIT is mentioned.
- `--task-description`: Includes detailed information about the kind of task that the HIT contains. On the Amazon Mechanical Turk web site, the HIT description appears in the expanded view of search results, and in the HIT and assignment screens.
- `--task-tags`: One or more words or phrases that describe the HIT, separated by commas. On MTurk website, these words are used in searches to find HITs.
- `--additional-task-description`: Additional text to show in the left-hand pane of the chat window.


### CLI arguments

A comprehensive list of settings specific to ACUTE-Eval can be found in `add_args_to_group()` in `acute_eval_blueprint.py`. For the arguments most likely to be useful for running ACUTE-Eval, see `example_script.py`.
5 changes: 5 additions & 0 deletions parlai/crowdsourcing/tasks/acute_eval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
51 changes: 51 additions & 0 deletions parlai/crowdsourcing/tasks/acute_eval/acute_eval_agent_state.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

from typing import List, Dict, Any, TYPE_CHECKING
from mephisto.server.blueprints.abstract.static_task.static_agent_state import (
StaticAgentState,
)
import time

if TYPE_CHECKING:
from mephisto.data_model.packet import Packet


DATA_FILE = "agent_data.json"


class AcuteEvalAgentState(StaticAgentState):
"""
Agent state for acute eval tasks.

Equivalent to StaticAgentState but doesn't have file IO.
"""

def get_parsed_data(self) -> List[Dict[str, Any]]:
data = self.get_data()
assert data is not None, "Should only check parsed data for completed tasks"
response_list = []
inputs: List[Dict[str, Any]] = data["inputs"]
outputs = data["outputs"]
assert inputs is not None
assert outputs is not None
for idx in range(len(inputs)):
entry: Dict[str, Any] = {}
entry.update(inputs[idx])
entry.update(outputs["final_data"][idx])
response_list.append(entry)
return response_list

def update_data(self, packet: "Packet") -> None:
"""
Process the incoming data packet, and handle updating the state.
"""
assert (
packet.data.get("MEPHISTO_is_submit") is True
), "Static tasks should only have final act"
self.state["times"]["task_end"] = time.time()
self.state["outputs"] = packet.data["task_data"]
self.save_data()
199 changes: 199 additions & 0 deletions parlai/crowdsourcing/tasks/acute_eval/acute_eval_blueprint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

from mephisto.data_model.blueprint import Blueprint
from mephisto.data_model.assignment import InitializationData
from parlai.crowdsourcing.tasks.acute_eval.acute_eval_agent_state import (
AcuteEvalAgentState,
)
from parlai.crowdsourcing.tasks.acute_eval.acute_eval_runner import AcuteEvalRunner
from parlai.crowdsourcing.tasks.acute_eval.acute_eval_builder import AcuteEvalBuilder
from mephisto.core.registry import register_mephisto_abstraction

import os
import math

from typing import ClassVar, List, Type, Any, Dict, Iterable, TYPE_CHECKING

if TYPE_CHECKING:
from mephisto.data_model.blueprint import AgentState, TaskRunner, TaskBuilder
from argparse import _ArgumentGroup as ArgumentGroup

BLUEPRINT_TYPE = "acute_eval"


# WISH AcuteEval's blueprint can probably be extended to compare more than just convos
@register_mephisto_abstraction()
class AcuteEvalBlueprint(Blueprint):
"""
Blueprint for a task that asks humans to compare conversational outputs.
"""

AgentStateClass: ClassVar[Type["AgentState"]] = AcuteEvalAgentState
TaskBuilderClass: ClassVar[Type["TaskBuilder"]] = AcuteEvalBuilder
TaskRunnerClass: ClassVar[Type["TaskRunner"]] = AcuteEvalRunner
supported_architects: ClassVar[List[str]] = ["mock"] # TODO update
BLUEPRINT_TYPE = BLUEPRINT_TYPE

@classmethod
def assert_task_args(cls, opts: Any) -> None:
"""
Ensure that the data can be properly loaded.
"""
if opts.get("pairings_filepath") is not None:
pairings_filepath = os.path.expanduser(opts["pairings_filepath"])
assert os.path.exists(
pairings_filepath
), f"Provided file {pairings_filepath} doesn't exist"
elif opts.get("pairings_task_data") is not None:
assert (
len(opts.get("pairings_task_data")) > 0
), "Length of data dict provided was 0"
else:
raise AssertionError(
"Must provide one of a data csv, json, or a list of tasks"
)

if opts.get("block_on_onboarding_fail") is True:
if opts.get("block_qualification") is None:
raise AssertionError(
"Must provide `block_qualification` to use `block_on_onboarding_fail`"
)

@classmethod
def add_args_to_group(cls, group: "ArgumentGroup") -> None:
"""
Adds required options for AcuteEvalBlueprints.

task_source points to the file intending to be deployed for this task
pairings_filepath has the data to be deployed for this task.
"""
super(AcuteEvalBlueprint, cls).add_args_to_group(group)

group.description = """
AcuteEvalBlueprint: Tasks launched from acute eval blueprints
require sets of pairings for workers to be able to compare to.

These pairings can be provided as a csv or by passing a
pairings_task_data dict into extra_args.
"""
group.add_argument(
"--annotations-per-pair",
dest="annotations_per_pair",
type=int,
default=1,
help="Number of annotations per conversation comparison pair",
)
group.add_argument(
"--pairings-filepath",
dest="pairings_filepath",
type=str,
default=None,
help="path to the file containing the task dictionaries",
)
# group.add_argument(
# '--task-config',
# type=dict,
# default=DEFAULT_TASK_CONFIG,
# help='dict with keys "hit_title", "hit_description", "hit_keywords", '
# 'determining how task is displayed on MTurk site',
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can remove this arg

group.add_argument(
"--s1-choice",
dest="s1_choice",
type=str,
default="I would prefer to talk to <Speaker 1>",
help="text next to speaker 1 radio button",
)
group.add_argument(
"--s2-choice",
dest="s2_choice",
type=str,
default="I would prefer to talk to <Speaker 2>",
help="text next to speaker 2 radio button",
)
group.add_argument(
"--eval-question",
dest="eval_question",
type=str,
default="Who would you prefer to talk to for a long conversation?",
help='question to present to turker for comparison (e.g. "Which speaker is better?")',
)
group.add_argument(
"--block-on-onboarding-fail",
dest="block_on_onboarding_fail",
type=bool,
default=True,
help="whether to block on onboarding failure",
)
group.add_argument(
"--subtasks-per-unit",
dest="subtasks_per_unit",
type=int,
default=5,
help="number of subtasks/comparisons to do per unit",
)
group.add_argument(
"--onboarding-threshold",
dest="onboarding_threshold",
type=float,
default=0.75,
help="minimum accuracy on onboarding tasks, as a float 0-1.0",
)
group.add_argument(
"--random-seed",
dest="random_seed",
type=int,
default=42,
help="seed for random",
)
# group.add_argument(
# '--softblock-list-path',
# dest="softblock_list_path",
# type=str,
# default=None,
# help='Path to list of workers to softblock, separated by line breaks',
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this one as well

group.add_argument(
"--additional-task-description",
dest="additional_task_description",
type=str,
default='',
help="Additional text to show on the left pane",
)
return

def get_frontend_args(self) -> Dict[str, Any]:
"""
Specifies what options within a task_config should be forwarded to the client
for use by the task's frontend.
"""
return {
"task_description": "Placeholder Task Description - Javascript failed to load",
"frame_height": 650,
"num_subtasks": self.opts["subtasks_per_unit"],
"question": self.opts["eval_question"],
"block_mobile": True,
"get_task_feedback": False, # TODO(#95) make option
"additional_task_description": self.opts['additional_task_description'],
}

def get_initialization_data(self) -> Iterable["InitializationData"]:
"""
Return the InitializationData retrieved from the specified stream.
"""
# TODO(#99) once we can release HITs over time, configure this to
# release as many as needed thusfar and top off when
# onboardings fail
print(self.opts)
num_conversations = math.ceil(
self.opts.get("num_matchup_pairs", 8)
/ max((self.opts["subtasks_per_unit"] - 1), 1)
) # release enough hits to finish all annotations requested
return [
InitializationData(shared={}, unit_data=[{}])
for d in range(num_conversations)
]
Loading