-
Notifications
You must be signed in to change notification settings - Fork 0
/
introduction.tex
290 lines (250 loc) · 25.4 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
\chapter{Introduction}
Communication is a fundamental part of human society, enabling the transfer of information between people.
Textual language in particular is found throughout our daily lives;
when asking a question, we type it into a search engine and usually read an answer from either Wikipedia or discussions between other people;
to find out what is happening in the world we read newspapers or listen to a newsreader;
for entertainment we read stories, listen to podcasts, or watch TV shows (virtually all of which contain dialogue);
and for personal communication we write electronic messages in a variety of formats.
One of the keys to all of this communication is the use of structure to convey meaning.
Part of this structure is visible--by convention you are looking at this page from top to bottom and left to right, treating each connected component of ink (or light) as a character, combining a sequence of characters into words, and so on to progressively larger levels of structure.
The focus of this dissertation, syntax, is the mostly invisible structure that exists somewhere between characters and sentences.
People are able to identify syntactic structure by drawing on knowledge from a range of sources, and in the process they discard a vast array of alternative interpretations for a given utterance.
In the past century, artificial symbolic processing by machines has advanced dramatically, to the point where computers can start to engage in textual communication, both understanding what people have written and writing back.
Approaching human-level communication will require systems to resolve syntactic ambiguities in text, either explicitly with an interpretable syntactic structure or implicitly with some other intermediate representation.
Over the past two decades there has been rapid development in systems for the syntactic parsing task, where the input is a single sentence and the output is a structure that encodes syntactic relationships between words in the sentence.
This development has largely been driven by new statistical methods for constructing models from resources manually labeled with syntactic structures by linguists.
In addition to variations in statistical and engineering approaches, research has explored various syntactic representations.
These representations are based on different linguistic theories, each with a body of research going into its design.
These theories vary in several ways: what form is used to encode syntax, what distinctions in structure are meaningful or not (and so represented or not), and what distinctions fall within syntax (and so represented here, or in downstream processing).
In this thesis, we present new algorithms that extend the capabilities of various systems related to syntax.
Together, these three areas of development show how we can go beyond the standard approach to parsing research, considering new challenges and ways to approach them.
The first area of research we consider is how the community evaluates automatic syntactic parsers to understand their strengths and weaknesses.
We propose a two-stage algorithm that searches for a set of transformations that will correct a parse, then classifies each transformation into one of several error types.
Using the algorithm, we compare parse errors for a range of systems, a range of text domains, and two languages.
The second aspect of syntax we consider is conversion between two different linguistic formalisms, which enables comparison of different parsers, and provides greater flexibility for projects where the parser is part of a larger pipeline.
Our system uses a bottom-up approach that is more flexible and more effective than previous work.
Finally, we explore how to extend parsing algorithms beyond tree structured output.
Existing algorithms need the tree constraint to be efficient, but the standard way to apply the constraint has been to discard structure for a range of phenomena, such as wh-movement, passivization, and fronting.
We introduce a new algorithm that can efficiently find the max or sum over all possible graph structures for a sentence, scored with an edge-factored model.
We also define a new parse representation that is compatible with the algorithm and deterministically convertable into the Penn Treebank (\ptb) style.
Together, our algorithm and representation are able to produce parses that cover virtually all forms of structure in the Penn Treebank, providing a more complete expression of the structure of sentences for downstream processing.
\section{Syntax}
\label{sec:syntax}
\begin{figure}
\scalebox{0.7}{
\framebox{
\begin{tikzpicture}
\node (w0) at (0, 0) {\strut Ellen};
\node (root) [left=3ex of w0.west] {\strut ROOT};
\node (w1) [right=3ex of w0.east] {\strut enjoys};
\node (w2) [right=3ex of w1.east] {\strut running};
\draw [-Stealth] (w0.north east) to [out=45,in=135] node[midway,above=-2pt] {\small nsubj} (w1.north west);
\draw [dashed,-Stealth] (w0.north) to [out=60,in=120] node[midway,above=-2pt] {\small nsubj} (w2.north);
\draw [Stealth-] (w1.north east) to [out=45,in=135] node[midway,above=-2pt] {\small xcomp} (w2.north west);
\draw [-Stealth] (w1.north) to [out=120,in=60] node[midway,above=-1pt] {\small root} (root);
\end{tikzpicture}
}
}
\hfill
\scalebox{0.7}{
\framebox{
\begin{tikzpicture}
\node (w0) at (0, 0) {\strut Ellen};
\node (w1) [right=3ex of w0.east] {\strut enjoys};
\node (w2) [right=3ex of w1.east] {\strut};
\node (w3) [right=3ex of w2.east] {\strut running};
\node (v0) [above=2ex of w0.north] {\strut};
\node (v1) [above=2ex of v0.north] {\strut};
\node (v2) [above=2ex of v1.north] {\strut};
\node (v3) [above=2ex of v2.north] {\strut};
\node (v4) [above=2ex of v3.north] {\strut};
\node (nt2) at (v0 -| w2) {\strut *-1};
\node (nt2b) at (v1 -| w2) {\strut NP};
\node (nt3) at (v1 -| w3) {\strut VP};
\path (nt2b) -- node[midway] (h2-3) {} (nt3);
\node (nt2-3) at (v2 -| h2-3) {\strut S};
\node (nt0) at (v3 -| w0) {\strut NP-1};
\path (w1 |- v3) -- node[midway] (h1-3) {} (nt2-3);
\node (nt1-3) at (v3 -| h1-3) {\strut VP};
\path (nt0) -- node[midway] (h0-3) {} (nt1-3);
\node (nt0-3) at (v4 -| h0-3) {\strut S};
\draw (nt2.north) -- (nt2b.south);
\draw (w3.north) -- (nt3.south);
\draw (w0.north) -- (nt0.south);
\draw (nt2b.north) -- (nt2-3.south) -- (nt3.north);
\draw (w1.north) -- (w1 |- nt2-3.north) -- (nt1-3.south) -- (nt2-3.north);
\draw (nt0.north) -- (nt0-3.south) -- (nt1-3.north);
\draw [dashed] (nt2.west) to [out=180,in=0] (nt0.east);
\end{tikzpicture}
}
}
\hfill
\scalebox{0.7}{
\framebox{
\begin{tikzpicture}[every text node part/.style={align=center}]
\node (w0) at (0, 0) {\strut Ellen \\ \strut $\cf{NP}$};
\node (w1) [right=2ex of w0.east] {\strut enjoys \\ \strut $\cf{(S\bs NP)/(S\bs NP)}$};
\node (w2) [right=2ex of w1.east] {\strut running \\ \strut $\cf{S\bs NP}$};
\node (c00L) at (w1.south west) {};
\node (c00R) at (w2.south east) {};
\draw [-{Straight Barb[length=2mm]}] (c00L) -- node[below=0ex] (c00) {\strut $\cf{S\bs NP}$} (c00R);
\node (c10L) at (w0.south west |- c00.south) {};
\node (c10R) at (c00R |- c00.south) {};
\draw [{Straight Barb[length=2mm]}-] (c10L) -- node[below=0ex] (c10) {\strut $\cf{S}$} (c10R);
%%% % Alternative
%%% \node (w0b) at (0, -4) {\strut Ellen \\ \strut $\cf{NP}$};
%%% \node (w1b) [right=2ex of w0b.east] {\strut enjoys \\ \strut $\cf{(S\bs NP)/(S\bs NP)}$};
%%% \node (w2b) [right=2ex of w1b.east] {\strut running \\ \strut $\cf{S\bs NP}$};
%%% \node (c00Lb) at (w0b.south west) {};
%%% \node (c00Rb) at (w0b.south east) {};
%%% \draw [-] (c00Lb) -- node[at end] {\small \textbf{T}} node[below=0ex] (c00b) {\strut $\cf{S/(S\bs NP)}$} (c00Rb);
%%% \node (c10Lb) at (w0b.south west |- c00b.south) {};
%%% \node (c10Rb) at (w1b.south east |- c00b.south) {};
%%% \draw [-{Straight Barb[length=2mm]}] (c10Lb) -- node[pos=1.03] {\small \textbf{B}} node[below=0ex] (c10b) {\strut $\cf{S/(S\bs NP)}$} (c10Rb);
%%% \node (c20Lb) at (w0b.south west |- c10b.south) {};
%%% \node (c20Rb) at (w2b.south east |- c10b.south) {};
%%% \draw [-{Straight Barb[length=2mm]}] (c20Lb) -- node[below=0ex] (c20b) {\strut $\cf{S}$} (c20Rb);
\end{tikzpicture}
}
}
\caption[Dependency, constituency, and categorial forms of the sentence \textexample{Ellen enjoys running}.]{
\label{fig:syntax} Dependency, constituency, and categorial forms of the sentence \textexample{Ellen enjoys running}.
Visualizations used here are in the style of \textcite[][dependency]{ud}, \textcite[][constituency]{ptb-guide} and \textcite[][categorial]{Steedman:2000}.
}
\end{figure}
One of the core research areas in linguistics is syntax, the study of processes that determine word order in sentences.
For the purposes of this thesis, it is important to understand a few general properties of the syntactic theories we consider.
Each theory encodes relationships between words using structure, but the form of those structures varies considerably.
Figure~\ref{fig:syntax} shows how the sentence \textexample{Ellen enjoys running} is represented in the three syntactic formalisms we use.
In the leftmost case, which uses dependency grammar \parencite[\depgr;][]{dependency-grammar}, each edge indicates a relationship between two words.
The label expresses the type of dependency, out of 40-50 types \parencite{ud,sd}.
The arrow points from the word that is the \textit{dependent} / \textit{child} to the word that is the \textit{head} / \textit{parent}\footnote{The direction of the arrow is a convention we are following from linguistics. Note that this is the reverse of the convention in graph theory.}.
The dashed edge is often excluded in order to make the edges form a tree (a structure where every word has exactly one parent and the edges from one connected structure).
The middle case, which uses Government and Binding Theory \parencite[\gb;][]{gb}, has a hierarchical phrase structure.
Each symbol is a \textit{constituent}, capturing the idea that the set of words beneath it constitute a single functional unit.
Constituents are linked together to form larger constituents according to rules.
This figure also shows a null element in between \textexample{enjoys} and \textexample{running}, which is used to encode the relationship between \textexample{Ellen} and \textexample{running}.
As with the dashed edge in the dependency case, the null element is often removed from this structure to make it a tree.
The rightmost case, categorial grammar \parencite{categorial-grammar}, also has a hierarchical structure, but uses complex lexical categories that are then combined according to a small set of inference rules.
This particular example uses Combinatory Categorial Grammar \parencite[\ccg;][]{Steedman:2000}, a variant in which combinatory logic is used to construct both syntactic and semantic forms in parallel.
Immediately beneath each word is its lexical category, which can either by an atomic symbol, or a complex structured combination of symbols.
A small set of generic combinators define how pairs of adjacent categories can be combined.
Along with the syntactic derivation, \ccg builds up a lambda expression denoting the semantic structure of the sentence (not shown in this figure).
One interesting property of the formalism is that a sentence can have multiple different derivations with the same semantic representation.
This ability to have different structures encode the same meaning is a form of derivational or spurious ambiguity\footnote{\textcite{Steedman:2000} explores the possibility that this ambiguity could encode variations in prosody.}.
Note that unlike the previous two approaches, here the connection between \textexample{Ellen} and \textexample{running} is retained while keeping the structure a tree.
This is possible because the relation has been threaded through the tree via the categories.
%%%Comparing these three representations of syntax we can see some axes of variation:
%%%\begin{itemize}
%%% \item[Lexical heads]
%%% Explicit in \depgr. Implicitly encoded by \ccg via the direction of composition. Not represented in \gb.
%%% \item[Constituents]
%%% Explicit in \gb and \ccg. Can be inferred in \depgr by considering all descendants of a word (note, not necessarily a continuous span).
%%% \item[Number of rules vs. number of labels]
%%% \ccg has a small set of general combinators, but many categories. \gb has many rules, but only a few labels. \depgr has a small number of both.
%%% \item[Structural properties]
%%% \ccg derivations are always trees, but have spurious ambiguity, while \depgr and \gb do not have spurious ambiguity but cannot represent all relations in a tree.
%%%\end{itemize}
While syntactic formalisms are used in a variety of ways, our focus is on the automatic production of syntactic structures by computer programs.
These programs take a sentence as input, consider possible structures and return the one that is best according to some scoring model\footnote{The sentence is assumed to be grammatical--a parse is always returned.}.
Individually considering every possible structure for a sentence is infeasible as the number of possible parses is exponential in the length of the sentence.
One way to avoid this issue is to perform an approximate search that incrementally builds the parse, maintaining only a few options at any given point.
However, if the part of the optimal structure that is built first scores poorly, then it may drop out of the list of options, and so there is no guarantee of finding the optimal parse.
Alternatively, we can maintain optimality at the cost of model flexibility, constraining the model so that the score for a parse is the sum of scores for each of its components.
This work follows the second approach, which enables dynamic programming methods, where the larger problem (find the optimal parse) is decomposed into independent sub-problems (find the optimal parse for part of the sentence) whose solutions can be used to solve the original problem (take the best solution from one half of the sentence and combine it with the best solution from the other half).
The specific form of dynamic programming applied to the task of finding optimal parses structures is the CKY algorithm \parencite{Cocke:1969,Kasami:1966,Younger:1967}.
\section{Error Analysis}
The standard resource for parsing research is the Wall Street Journal section of the Penn Treebank \parencite{ptb}, a collection of one million words of text from 1989 issues of the Wall Street Journal that have been annotated by experts with syntactic structure in a \gb style.
The standard measure of constituent parser performance is the F-Score, the harmonic mean of precision\footnote{
Number of correct nodes in the output structure, divided by the total number of nodes it has.
} and recall\footnote{
Number of correct nodes in the output structure, divided by the number of nodes in the gold structure.
} on labeled nodes in the parse.
Performance on \wsj section 23 has exceeded $90$ F\textsubscript{1} \parencite{Petrov-Klein:2007}, and $92$ F\textsubscript{1} when using self-training and reranking \parencite{McClosky-Charniak-Johnson:2006,Charniak-Johnson:2005}.
While these results give a useful measure of overall performance, they provide no information about the nature, or relative importance, of the remaining errors.
Broad investigations of parser errors beyond the \parseval metric \parencite{Black-etal:1991} have either focused on specific parsers, \myeg \parencite{Collins:2003}, or have involved conversion to \depgr \parencite{Carroll-etal:1998,King:2003}.
In all of these cases, the analysis has not taken into consideration how a set of errors can have a common cause, \myeg a single mis-attachment can create multiple node errors.
In the first part of the thesis, we propose a new method of error classification using tree transformations.
Errors in the parse tree are repaired using subtree movement, node creation, and node deletion.
Each step in the process is then associated with a linguistically meaningful error type, based on factors such as the node that is moved, its siblings, and parents.
Using our method we analyze the output of thirteen constituency parsers on newswire.
Some of the frequent error types that we identify are widely recognized as challenging, such as prepositional phrase (PP) attachment.
However, other significant types have not received as much attention, such as attachment of clauses, adjective phrases, and adverb phrases.
We also investigate where reranking and self-training improve parsing, and where performance decreases when parsing out-of-domain text.
Previously, these were all analyzed only in terms of their impact on F-score.
\section{Formalism Conversion}
As shown above, there are many ways of expressing syntactic structure.
Extensive work has gone into converting the Penn Treebank to other formalisms, such as \hpsg \parencite{Miyao-Ninomiya-Tsujii:2004}, \lfg \parencite{Cahill-etal:2008}, \ltag \parencite{Xia:1999}, and \ccg \parencite{CCGBank},
These conversions are complex processes that render linguistic phenomena in formalism-specific ways.
Tools for the reverse process, converting back to the \ptb, enable performance comparisons between parsers based on the two formalisms, and provide more options for researchers and developers building pipelines in which parsing is only one step.
However, the reversal is difficult, as the original conversion may have lost information or smoothed over inconsistencies in the corpus.
\textcite{Clark-Curran:2009} developed a \ccg to \ptb auto-conversion tool that treats the \ccg derivation as a phrase structure tree and applies hand-crafted rules to every pair of categories that combine in the derivation.
Because their approach does not exploit the generalizations inherent in the \ccg formalism, they must resort to ad-hoc rules over non-local features of the \ccg constituents being combined (when a fixed pair of \ccg categories correspond to multiple \ptb structures).
Even with such rules, they correctly auto-convert only 39.7\% of gold CCGbank derivations.
In the second chapter, we describe an auto-conversion method that assigns a set of bracket instructions to each word based on its \ccg category, then follows the \ccg derivation, applying and combining instructions at each combinatory step to build a phrase structure tree.
This requires specific instructions for each category (not all pairs), and generic operations for each combinator.
Unlike \textcite{Clark-Curran:2009}'s approach, we require no rules that consider non-local features of constituents, which enables the possibility of simple integration with a CKY-based parser.
Our approach perfectly auto-converts 51.4\% of sentences, an 11.7\% (absolute) improvement over \textcite{Clark-Curran:2009}.
On the remaining sentences our auto-conversion generally handles most of the sentence correctly, but makes mistakes on some clause spans and rare spans such as QPs, NXs, and NACs.
Many of these errors are inconsistencies in the original \ptb annotations that are not recoverable.
Applying our tool to the output of several \ccg parsers, we are able to compare them with standard \ptb parsers, showing that their accuracy falls within the middle of the performance range of well known systems.
Our auto-conversion tool is easy to use and more effective than prior work, providing a convenient means of getting \ptb-style output from a \ccg parser.
\section{Graph Parsing} \label{sec:intro-graph}
In Section~\ref{sec:syntax} we saw that while parse structures can be graphs and discontinuous, removing some edges in the dependency parse, or the traces in the Government and Binding structure, can make the structures into projective trees.
Whether these edges are included in our grammar impacts the class of formal languages we are generating \parencite{Chomsky-grammars}.
The trees are entirely within the context-free class, while the graphs are in the context-sensitive class.
There are a range of well known polynomial time algorithms for parsing in the context-free class \parencite{Cocke:1969,Younger:1967,Kasami:1966,Earley:1970,Lang:1974}, but parsing of context-sensitive grammars is PSPACE complete \parencite{Kuroda:1964,Savitch:1970}.
Fortunately, human language appears to fall into a class somewhere in between these two, which researchers have attempted to characterize using mildly-context-sensitive grammars \parencite{weir-joshi:1988:ACL} and range concatenation grammars \parencite{Boullier:1998}.
In general, work on these intermediate classes has been in conjunction with the development of new formalisms\footnote{
For example, \ccg falls into the mildly-context-sensitive class.
}.
In the Penn Treebank, the traces are distinguished from the core projective tree structure, and are used to represent control structures, wh-movement and more.
Traces are indicated using nodes in the parse that do not span any words (null elements) and numbers to indicate a connection with another node in the parse (co-indexation, with the null element getting a reference index and the other node getting an identity index).
By varying the null element used, different forms of movement can be indicated.
The list below describes all of the null elements used in the treebank, though only the first two can be assigned reference indexes \parencite{ptb-guide}:
\begin{itemize}
\item *T* -- Trace of non-argument movement, including parasitic gaps
\item (NP *) -- Trace of argument movement, arbitrary PRO, and controlled PRO
\item 0 -- Null complementizer, including null wh-operator
\item *U* -- Unit
\item *?* -- Placeholder for ellipsed material
\item *NOT* -- Anti-placeholder in template gapping
\item *RNR* -- Pseudo-attach: right node raising
\item *ICH* -- Pseudo-attach: interpret constituent here
\item *EXP* -- Pseudo-attach: expletive
\item *PPA* -- Pseudo-attach: permanent predictable ambiguity
\end{itemize}
However, most parsers and the standard evaluation metric ignore these edges and all null elements, focusing entirely on the tree structure.
By leaving out parts of the structure, they are not explicitly representing all of the relations in the sentence.
These aspects of syntax are excluded not because of disagreements regarding theory, but rather because of the computational challenge of including them.
Unfortunately, this means that downstream tasks such as question answering have to make do with the more limited structure, \myeg in \textexample{Who enjoys running?} the link between \textexample{who} and \textexample{running} is not present in the tree structure.
While there has been work on capturing some parts of this extra structure, it has generally either been through post-processing on trees \parencite{Johnson:2002,Jijkoun:2003,Campbell:2004,Levy:2004,Gabbard:2006}, or has only captured a limited set of phenomena via grammar augmentation \parencite{Collins:1997,dienes-dubey:2003,schmid:2006,cai-chiang-goldberg:2011}.
In both cases phenomena such as shared argumentation are completely ignored.
Similarly, most work on the Abstract Meaning Representation \parencite{amr}, has removed edges to turn all structures into trees.
\begin{figure}
\centering
\scalebox{1.0}{
\input{figures/picture-simple-flat}
}
\caption[Parse representations for graph structures.]{ \label{fig:repr}
Parse representations for graph structures: (a) constituency (b) ours.
}
\end{figure}
In the final chapter, we propose a new parse representation and a new algorithm that can efficiently consider almost all observed syntactic phenomena.
Our representation is an extension of TAG-based tree representations \parencite{cck,Shen:2007}, modified to represent graphs and designed to maximize coverage under a new class of graphs.
Our algorithm extends a non-projective tree parsing algorithm \parencite{ec} to graph structures, with improvements to avoid derivational ambiguity.
Our representation, shown in Figure~\ref{fig:repr}b, consists of complex tags composed of non-terminals, and edges indicating attachment.
In this form, traces can create problematic structures such as directed cycles, but we show how careful choice of head rules can minimize such issues.
Our algorithm runs in time $O(n^4)$ under a first-order model.
We also introduce extensions that ensure parses contain a directed projective tree of non-trace edges.
We implemented a proof-of-concept parser with a basic first-order model, which scored $88.3$ on the standard evaluation metric (F$_1$ on trees), and recovered a range of trace types.
Together, our representation and algorithm form an inference method that can cover $97.7\%$ of sentences, far above the coverage of projective tree algorithms ($46.8\%$).
\section{Contributions of This Dissertation}
Our contributions are a set of novel algorithms and experimental results and analysis using those algorithms.
First, we define a new algorithm for error analysis of constituency parsing output, which provides a more intuitive breakdown of error types than previous approaches.
We implement the algorithm and use it to give insight into current parsing effectiveness, considering a wide range of systems, multiple domains, and two languages.
Second, we present a new algorithm for transforming \ccg derivations into \gb parses.
Our approach is significantly more accurate than previous work and has desirable algorithmic properties.
Finally, we describe the first algorithm for inference over the space of \gb graph structures.
Previous work compromised by leaving out important aspects of the syntactic representation in order to satisfy constraints of their parsing algorithms.
We show how to efficiently implement the algorithm, and discuss results and remaining challenges.