facebookresearch · EricMichaelSmith · Aug 25, 2020 · Aug 24, 2020 · Aug 24, 2020 · Aug 24, 2020
diff --git a/parlai/crowdsourcing/__init__.py b/parlai/crowdsourcing/__init__.py
@@ -0,0 +1,5 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/parlai/crowdsourcing/tasks/__init__.py b/parlai/crowdsourcing/tasks/__init__.py
@@ -0,0 +1,5 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/parlai/crowdsourcing/tasks/acute_eval/README.md b/parlai/crowdsourcing/tasks/acute_eval/README.md
@@ -0,0 +1,101 @@
+# ACUTE-Eval
+
+## Paper information
+
+Margaret Li, Jason Weston, Stephen Roller.
+_[ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons](https://arxiv.org/abs/1909.03087)_.
+
+## Citation
+
+If you use this evaluation method in your own work, please cite with the
+following BibTex entry:
+
+    @misc{li2019acuteeval,
+      title={ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons},
+      author={Margaret Li and Jason Weston and Stephen Roller},
+      year={2019},
+      journal={Advances in Neural Information Processing Systems, Conversational AI Workshop},
+      url={https://arxiv.org/abs/1909.03087}
+    }
+
+# Code Instructions
+Once you have installed [ParlAI](https://github.com/facebookresearch/ParlAI/#installing-parlai) and [Mephisto](https://github.com/facebookresearch/mephisto/blob/master/docs/quickstart.md), follow the instructions below.
+
+The `run.py` script is designed to allow you to run this entire task from command line with an invocation like
+
+    python parlai/crowdsourcing/tasks/acute_eval/example_script.py \
+    --pairings-filepath parlai/crowdsourcing/tasks/acute_eval/pairings.jsonl
+
+## Formatting conversation data
+
+This task code assumes that you've parsed and saved your collected conversations in a simple .jsonl format. The path to this file should be passed in as `--pairings-filepath`.
+
+This is a template of the expected format with the minimal expected fields:
+
+    {
+      "is_onboarding": false,
+      "speakers_to_eval": ["first_modelname", "second_modelname"],
+      "dialogue_ids": [dialogue_1_id, dialogue_2_id],
+      "dialogue_dicts": [
+        {
+          "speakers": ["first_modelname", "other_speaker"],
+          "dialogue": [
+            {"id": "model1", "text": "Hi"},
+            {"id": "other_speaker", "text": "Hi back"},
+            ...
+          ]
+        },
+        {
+          "speakers": ["other_speaker", "second_modelname"],
+          "dialogue": [
+            {"id": "model1", "text": "Hi"},
+            {"id": "other_speaker", "text": "Hi back"},
+            ...
+          ]
+        }
+      ]
+    }
+
+You can add an `"image_src"` key to an entry of `"dialogue"` to append an image to a chat message. The value of the key should be a serialized image, starting with a string such `data:image/jpeg;base64,`.
+
+For onboarding tasks (tasks used to filter workers, see below for more details) you must additionally set a `correct_answer` field:
+
+    {
+      "is_onboarding": true,
+      "speakers_to_eval": ["first_modelname", "second_modelname"],
+      "correct_answer": "correct_modelname",
+      "dialogue_dicts": [
+        # as above
+      ]
+    }
+
+Note that we assume that "dialogue" consists of strictly alternating turns (e.g. speakers a, b, a, b, a...). Additionally, `speakers_to_eval` must be in the same order as the dialogue_dicts. See `pairings.jsonl` for examples of the format required.
+
+## Question phrasing
+
+In our paper, we address the problem of wording the questions and binary choices in order to elicit the highest signal responses. The default question and choices correspond to our highest signal 'engagingness' phrasing, but it's very easy to customize this by changing `extra_args['eval_question'], extra_args['s1_choice'], extra_args['s2_choice']` in `example_script.py`. The special strings `<Speaker 1>` and `<Speaker 2>` are replaced when showing these questions to the user, and the Speaker's utterances in each conversation will be colored to identify that Speaker.
+
+
+## Onboarding tasks
+
+As discussed in the paper, we found that we had better annotation quality if we screened Turkers with an 'onboarding' comparison, consisting of a weak baseline conversation and a human-human conversation. Our code is set up so that this is optional.
+
+By default, `extra_args['block_on_onboarding_fail']` is set to `True`, which means that workers who fail onboarding will be soft-blocked. In other words, they won't be able to see or complete any more hits from you, but won't receive any notification that they've been blocked. The Mechanical Turk qualification name used to soft block must be set with `extra_args['block_qualification']`.
+
+By setting `extra_args['onboarding_threshold']`, you can also adjust the minimum proportion of onboarding tasks (if you have multiple) that must be answered correctly to pass onboarding.
+
+
+## Other settings
+
+### Task configuration on MTurk
+
+The title, description, and keywords of the task as shown on MTurk default to values in `ARG_STRING` in `example_script.py`. These values are used as follows:
+- `--task-title`: A short and descriptive title about the kind of task that the HIT contains. On the Amazon Mechanical Turk web site, the HIT title appears in search results and everywhere that the HIT is mentioned.
+- `--task-description`: Includes detailed information about the kind of task that the HIT contains. On the Amazon Mechanical Turk web site, the HIT description appears in the expanded view of search results, and in the HIT and assignment screens.
+- `--task-tags`: One or more words or phrases that describe the HIT, separated by commas. On MTurk website, these words are used in searches to find HITs.
+- `--additional-task-description`: Additional text to show in the left-hand pane of the chat window.
+
+
+### CLI arguments
+
+A comprehensive list of settings specific to ACUTE-Eval can be found in `add_args_to_group()` in `acute_eval_blueprint.py`. For the arguments most likely to be useful for running ACUTE-Eval, see `example_script.py`.
diff --git a/parlai/crowdsourcing/tasks/acute_eval/__init__.py b/parlai/crowdsourcing/tasks/acute_eval/__init__.py
@@ -0,0 +1,5 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/parlai/crowdsourcing/tasks/acute_eval/acute_eval_agent_state.py b/parlai/crowdsourcing/tasks/acute_eval/acute_eval_agent_state.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from typing import List, Dict, Any, TYPE_CHECKING
+from mephisto.server.blueprints.abstract.static_task.static_agent_state import (
+    StaticAgentState,
+)
+import time
+
+if TYPE_CHECKING:
+    from mephisto.data_model.packet import Packet
+
+
+DATA_FILE = "agent_data.json"
+
+
+class AcuteEvalAgentState(StaticAgentState):
+    """
+    Agent state for acute eval tasks.
+
+    Equivalent to StaticAgentState but doesn't have file IO.
+    """
+
+    def get_parsed_data(self) -> List[Dict[str, Any]]:
+        data = self.get_data()
+        assert data is not None, "Should only check parsed data for completed tasks"
+        response_list = []
+        inputs: List[Dict[str, Any]] = data["inputs"]
+        outputs = data["outputs"]
+        assert inputs is not None
+        assert outputs is not None
+        for idx in range(len(inputs)):
+            entry: Dict[str, Any] = {}
+            entry.update(inputs[idx])
+            entry.update(outputs["final_data"][idx])
+            response_list.append(entry)
+        return response_list
+
+    def update_data(self, packet: "Packet") -> None:
+        """
+        Process the incoming data packet, and handle updating the state.
+        """
+        assert (
+            packet.data.get("MEPHISTO_is_submit") is True
+        ), "Static tasks should only have final act"
+        self.state["times"]["task_end"] = time.time()
+        self.state["outputs"] = packet.data["task_data"]
+        self.save_data()
diff --git a/parlai/crowdsourcing/tasks/acute_eval/acute_eval_blueprint.py b/parlai/crowdsourcing/tasks/acute_eval/acute_eval_blueprint.py
@@ -0,0 +1,199 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from mephisto.data_model.blueprint import Blueprint
+from mephisto.data_model.assignment import InitializationData
+from parlai.crowdsourcing.tasks.acute_eval.acute_eval_agent_state import (
+    AcuteEvalAgentState,
+)
+from parlai.crowdsourcing.tasks.acute_eval.acute_eval_runner import AcuteEvalRunner
+from parlai.crowdsourcing.tasks.acute_eval.acute_eval_builder import AcuteEvalBuilder
+from mephisto.core.registry import register_mephisto_abstraction
+
+import os
+import math
+
+from typing import ClassVar, List, Type, Any, Dict, Iterable, TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from mephisto.data_model.blueprint import AgentState, TaskRunner, TaskBuilder
+    from argparse import _ArgumentGroup as ArgumentGroup
+
+BLUEPRINT_TYPE = "acute_eval"
+
+
+# WISH AcuteEval's blueprint can probably be extended to compare more than just convos
+@register_mephisto_abstraction()
+class AcuteEvalBlueprint(Blueprint):
+    """
+    Blueprint for a task that asks humans to compare conversational outputs.
+    """
+
+    AgentStateClass: ClassVar[Type["AgentState"]] = AcuteEvalAgentState
+    TaskBuilderClass: ClassVar[Type["TaskBuilder"]] = AcuteEvalBuilder
+    TaskRunnerClass: ClassVar[Type["TaskRunner"]] = AcuteEvalRunner
+    supported_architects: ClassVar[List[str]] = ["mock"]  # TODO update
+    BLUEPRINT_TYPE = BLUEPRINT_TYPE
+
+    @classmethod
+    def assert_task_args(cls, opts: Any) -> None:
+        """
+        Ensure that the data can be properly loaded.
+        """
+        if opts.get("pairings_filepath") is not None:
+            pairings_filepath = os.path.expanduser(opts["pairings_filepath"])
+            assert os.path.exists(
+                pairings_filepath
+            ), f"Provided file {pairings_filepath} doesn't exist"
+        elif opts.get("pairings_task_data") is not None:
+            assert (
+                len(opts.get("pairings_task_data")) > 0
+            ), "Length of data dict provided was 0"
+        else:
+            raise AssertionError(
+                "Must provide one of a data csv, json, or a list of tasks"
+            )
+
+        if opts.get("block_on_onboarding_fail") is True:
+            if opts.get("block_qualification") is None:
+                raise AssertionError(
+                    "Must provide `block_qualification` to use `block_on_onboarding_fail`"
+                )
+
+    @classmethod
+    def add_args_to_group(cls, group: "ArgumentGroup") -> None:
+        """
+        Adds required options for AcuteEvalBlueprints.
+
+        task_source points to the file intending to be deployed for this task
+        pairings_filepath has the data to be deployed for this task.
+        """
+        super(AcuteEvalBlueprint, cls).add_args_to_group(group)
+
+        group.description = """
+            AcuteEvalBlueprint: Tasks launched from acute eval blueprints
+            require sets of pairings for workers to be able to compare to.
+
+            These pairings can be provided as a csv or by passing a
+            pairings_task_data dict into extra_args.
+        """
+        group.add_argument(
+            "--annotations-per-pair",
+            dest="annotations_per_pair",
+            type=int,
+            default=1,
+            help="Number of annotations per conversation comparison pair",
+        )
+        group.add_argument(
+            "--pairings-filepath",
+            dest="pairings_filepath",
+            type=str,
+            default=None,
+            help="path to the file containing the task dictionaries",
+        )
+        # group.add_argument(
+        #     '--task-config',
+        #     type=dict,
+        #     default=DEFAULT_TASK_CONFIG,
+        #     help='dict with keys "hit_title", "hit_description", "hit_keywords", '
+        #     'determining how task is displayed on MTurk site',
+        # )
+        group.add_argument(
+            "--s1-choice",
+            dest="s1_choice",
+            type=str,
+            default="I would prefer to talk to <Speaker 1>",
+            help="text next to speaker 1 radio button",
+        )
+        group.add_argument(
+            "--s2-choice",
+            dest="s2_choice",
+            type=str,
+            default="I would prefer to talk to <Speaker 2>",
+            help="text next to speaker 2 radio button",
+        )
+        group.add_argument(
+            "--eval-question",
+            dest="eval_question",
+            type=str,
+            default="Who would you prefer to talk to for a long conversation?",
+            help='question to present to turker for comparison (e.g. "Which speaker is better?")',
+        )
+        group.add_argument(
+            "--block-on-onboarding-fail",
+            dest="block_on_onboarding_fail",
+            type=bool,
+            default=True,
+            help="whether to block on onboarding failure",
+        )
+        group.add_argument(
+            "--subtasks-per-unit",
+            dest="subtasks_per_unit",
+            type=int,
+            default=5,
+            help="number of subtasks/comparisons to do per unit",
+        )
+        group.add_argument(
+            "--onboarding-threshold",
+            dest="onboarding_threshold",
+            type=float,
+            default=0.75,
+            help="minimum accuracy on onboarding tasks, as a float 0-1.0",
+        )
+        group.add_argument(
+            "--random-seed",
+            dest="random_seed",
+            type=int,
+            default=42,
+            help="seed for random",
+        )
+        # group.add_argument(
+        #     '--softblock-list-path',
+        #     dest="softblock_list_path",
+        #     type=str,
+        #     default=None,
+        #     help='Path to list of workers to softblock, separated by line breaks',
+        # )
+        group.add_argument(
+            "--additional-task-description",
+            dest="additional_task_description",
+            type=str,
+            default='',
+            help="Additional text to show on the left pane",
+        )
+        return
+
+    def get_frontend_args(self) -> Dict[str, Any]:
+        """
+        Specifies what options within a task_config should be forwarded to the client
+        for use by the task's frontend.
+        """
+        return {
+            "task_description": "Placeholder Task Description - Javascript failed to load",
+            "frame_height": 650,
+            "num_subtasks": self.opts["subtasks_per_unit"],
+            "question": self.opts["eval_question"],
+            "block_mobile": True,
+            "get_task_feedback": False,  # TODO(#95) make option
+            "additional_task_description": self.opts['additional_task_description'],
+        }
+
+    def get_initialization_data(self) -> Iterable["InitializationData"]:
+        """
+        Return the InitializationData retrieved from the specified stream.
+        """
+        # TODO(#99) once we can release HITs over time, configure this to
+        # release as many as needed thusfar and top off when
+        # onboardings fail
+        print(self.opts)
+        num_conversations = math.ceil(
+            self.opts.get("num_matchup_pairs", 8)
+            / max((self.opts["subtasks_per_unit"] - 1), 1)
+        )  # release enough hits to finish all annotations requested
+        return [
+            InitializationData(shared={}, unit_data=[{}])
+            for d in range(num_conversations)
+        ]