This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Per-Turn Eval] Add project page (#4304)
* Add human eval project page * Reword * Spacing issues
- Loading branch information
1 parent
e4d59c3
commit 8d868db
Showing
2 changed files
with
13 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents | ||
|
||
Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston | ||
|
||
## Abstract | ||
|
||
At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known ([Liu et al., 2016](https://arxiv.org/abs/1603.08023)), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions. | ||
|
||
## Paper | ||
|
||
[Link coming soon!] |