-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
142 lines (134 loc) · 7.77 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="./static/index.css">
<title> The Effect of Natural Distribution Shift on Question Answering Models </title>
</head>
<body>
<section id='header'>
<h1 id='title'>
<a class='no-underline-link' href="https://arxiv.org/abs/2004.14444">
The Effect of Natural Distribution Shift <br/>
on Question Answering Models
</a>
</h1>
<h2 id='authors' class='text-muted'>
<ul>
<li><a href='https://people.eecs.berkeley.edu/~miller_john/'>John Miller</a></li>
<li><a href='https://www.karlk.net'>Karl Krauth</a></li>
<li><a href='https://people.eecs.berkeley.edu/~brecht/'>Benjamin Recht</a></li>
<li><a href='https://people.csail.mit.edu/ludwigs/'>Ludwig Schmidt</a></li>
</ul>
</h2>
</section>
<section id='abstract'>
<h3 class='heading'><span>Abstract</span></h3>
<div class='content'>
We build four new test sets for the
<a href="https://rajpurkar.github.io/SQuAD-explorer/">Stanford Question Answering Dataset (SQuAD)</a>
and evaluate the ability of question-answering systems to generalize to new
data. In the original Wikipedia domain, we find no evidence of adaptive
overfitting despite several years of test set re-use. On datasets derived
from New York Times articles, Reddit posts, and Amazon product reviews, we
observe average performance drops of 3.8, 14.0, and 17.4 F1, respectively,
across a broad range of models. In contrast, a strong human baseline matches
or exceeds the performance of SQuAD models on the original domain and
exhibits little to no drop in new domains. Taken together, our results
confirm the surprising resilience of the holdout method and emphasize the
need to move towards evaluation metrics that incorporate robustness to
natural distribution shifts.
</div>
</section>
<section id='teaser'>
<figure class='figure'>
<img id='teaser-img' class='figure-img img-fluid' src='./images/figure1.svg' />
<figcaption class='figure-caption text-left' id='figure1-caption'>
Model and human F1 scores on the original SQuAD v1.1 test set compared
to our new test sets for a broad set of more than 100 models. Each
point corresponds to a model evaluation, shown with 95%
Student's-t confidence intervals (mostly covered by the point
markers). The plots reveal three main phenomena: (i) There is no
evidence of adaptive overfitting on SQuAD, (ii) all of the
models suffer F1 drops on the new datasets, with the magnitude
of the drop strongly depending on the corpus, and (iii) humans
are substantially more robust to natural distribution shifts
than the models. The slopes of the linear fits are 0.92, 1.02,
1.19, and 1.36, respectively. This means that every point of F1
improvement on the original dataset translates into roughly 1
point of improvement on our new datasets.
</figcaption>
</figure>
</section>
<section id="leaderboards">
<h3 class="heading"> <span> Leaderboards </span></h3>
<div class="content">
<div id="links">
<a href="squad.html"> <span> New Wikipedia Dataset </span> </a>
<a href="nyt.html"> <span> New York Times </span> </a>
<a href="reddit.html"> <span> Reddit Comments </span> </a>
<a href="amazon.html"> <span> Amazon Reviews </span> </a>
</div>
</div>
</section>
<section id="downloads">
<h3 class="heading"> <span> Download Datasets </span></h3>
<div class="content">
<div id="links">
<a href="https://ndownloader.figshare.com/files/28472799?private_link=2f119bea3e8d711047ec">
<span> New Wikipedia Dataset </span>
</a>
<a href="https://ndownloader.figshare.com/files/28472796?private_link=2f119bea3e8d711047ec">
<span> New York Times </span>
</a>
<a href="https://ndownloader.figshare.com/files/28472805?private_link=2f119bea3e8d711047ec">
<span> Reddit Comments </span>
</a>
<a href="https://ndownloader.figshare.com/files/28472802?private_link=2f119bea3e8d711047ec">
<span> Amazon Reviews </span>
</a>
</div>
</div>
<br>
All the datasets are distributed under the <a href="https://creativecommons.org/licenses/by/4.0/legalcode"> CC BY 4.0 </a> license.
<br>
Datasets are also available via <a href="https://github.com/huggingface/datasets"> huggingface/datasets </a>.
<!-- HTML generated using hilite.me --><div style="background: #272822; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em; text-align: left"><pre style="margin: 0; line-height: 125%"><span style="color: #960050; background-color: #1e0010">!</span><span style="color: #f8f8f2">pip</span> <span style="color: #f8f8f2">install</span> <span style="color: #f8f8f2">datasets</span>
<span style="color: #f92672">from</span> <span style="color: #f8f8f2">datasets</span> <span style="color: #f92672">import</span> <span style="color: #f8f8f2">load_dataset</span>
<span style="color: #75715e"># One of 'new-wiki', 'nyt', 'reddit', 'amazon'</span>
<span style="color: #f8f8f2">dataset</span> <span style="color: #f92672">=</span> <span style="color: #f8f8f2">load_dataset(</span><span style="color: #e6db74">'squadshifts'</span><span style="color: #f8f8f2">,</span> <span style="color: #e6db74">'reddit'</span><span style="color: #f8f8f2">)</span>
</pre></div>
</section>
<section id="explore">
<h3 class="heading">
<span> <a href=https://huggingface.co/datasets/squadshifts> Explore Datasets</a> </span>
</h3>
</section>
<section id="paper">
<h3 class="heading"> <a href="https://arxiv.org/abs/2004.14444"> <span> Paper </span> </a></h3>
</section>
<section id='acknowledgements'>
<h3 class='heading'><span>Acknowledgements</span></h3>
We thank
<a href="https://rajpurkar.github.io">Pranav Rajpurkar</a>,
<a href="http://stanford.edu/~robinjia/">Robin Jia</a>,
and
<a href="https://cs.stanford.edu/~pliang/">Percy Liang</a>
for providing us with the
original SQuAD data generation pipeline and answering our many questions about
the SQuAD dataset. We thank
<a href="https://cs.stanford.edu/~nfliu/">Nelson Liu</a>
for generously providing a large number of the SQuAD models we evaluated, and we thank
<a href="https://worksheets.codalab.org/home">the Codalab team</a>
for supporting our model evaluation efforts. This research was
generously supported in part by the National Science Foundation Graduate
Research Fellowship Program under Grant No. DGE 1752814 ABC, an Amazon
AWS AI Research Award, and a gift from Microsoft Research.
</section>
</body>
</html>