Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Commit

Permalink
[Per-Turn Eval] Add project page (#4304)
Browse files Browse the repository at this point in the history
* Add human eval project page

* Reword

* Spacing issues
  • Loading branch information
EricMichaelSmith authored Jan 13, 2022
1 parent e4d59c3 commit 8d868db
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 0 deletions.
2 changes: 2 additions & 0 deletions projects/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,3 +144,5 @@ _QA model for answering questions by retrieving and reading knowledge._
- **ACUTE-Eval** [[parlai task]](https://github.com/facebookresearch/ParlAI/tree/main/parlai/crowdsourcing/tasks/acute_eval) [[paper]](https://arxiv.org/abs/1909.03087).
_ACUTE Eval is a sensitive human evaluation method for dialogue which evaluates whole conversations in a pair-wise fashion, and is our recommended method._

- **Human Evaluation Comparison** [[project]](https://parl.ai/projects/humaneval) [paper coming soon!].
_Compares how well different human crowdworker evaluation techniques can detect relative performance differences among dialogue models._
11 changes: 11 additions & 0 deletions projects/humaneval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston

## Abstract

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known ([Liu et al., 2016](https://arxiv.org/abs/1603.08023)), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.

## Paper

[Link coming soon!]

0 comments on commit 8d868db

Please sign in to comment.