implementation.tex

\section{Implementation} \label{sec:impl}

While the core algorithm is fully described in the previous sections, and has desirable asymptotic speed properties, a simple implementation would be slow.
Here we describe the additional design decisions needed to enable efficient parsing and modern training methods.
Each section below describes one component of the system, its optimizations and the motivation behind them.

\subsection{Model}

We use a discriminative model, which assigns scores to parses using a linear combination of weights.
Each weight corresponds to some feature of the parse and input data, \myeg the spine being added is NP and the word to the left is \textexample{the}.
Our features are based on the set defined by \textcite{McDonald-etal:2005:Proj}.

\subsection{Learning}

\renewcommand{\algorithmicrequire}{}
\renewcommand{\algorithmicensure}{}
\begin{algorithm}
\caption{Online Primal Subgradient with $\ell_1$ or $\ell_2$ regularization, sparse updates}
\algcomment{Note: To implement without the sparse update, use SCORE~$= \mathbf{w}^\top \mathbf{f}(y')$, and run the update loop over all features. Also, for comparison, to implement the perceptron, remove the sparse update and use UPDATE-ACTIVE $= \mathbf{return} \; w + g$.}
\label{alg:ops}
\noindent\begin{algorithmic}[1]
    \State Parameter: $\mathrm{iters}$ \Comment{Number of iterations}
    \State Parameter: C \Comment{Regularization constant ($10^{-1}$ to $10^{-8}$)}
    \State Parameter: $\eta$ \Comment{Learning rate ($10^0$ to $10^{-4}$)}
    \State Parameter: $\delta$ \Comment{Initializer for $q$ ($10^{-6}$)}
    \State $\mathbf{w} = \mathbf{0} $ \Comment{Weight vector}
    \State $\mathbf{q} = \boldsymbol{\delta} $ \Comment{Cumulative squared gradient}
    \State $\mathbf{u} = \mathbf{0} $ \Comment{Time of last update for each weight [sparse updates only]}
    \State $n = 0$ \Comment{Number of updates so far [sparse updates only]}
    \For{$\mathrm{iter} \in [1, \mathrm{iters}]$}
      \For{$\mathrm{batch} \in \mathrm{data}$}
        \State $\mathbf{g} = \mathbf{0}$ \Comment{Sum of gradients from loss-augmented decodes}
        \For{$(x_i,y_i) \in \mathrm{batch}$}
          \State $y = \argmax_{y' \in Y(x_i)} [${\footnotesize SCORE}$(y') + \mathbf{L}(y', y_i) ]$ \Comment{Loss-augmented decode}
          \State $\mathbf{g} \pluseq (\mathbf{f}(y) - \mathbf{f}(y_i))$ \Comment{Update the active features}
        \EndFor
        \State $\mathbf{q} \pluseq \mathbf{g}^2$ \Comment{Add the element-wise square of the subgradient}
        \State $n \pluseq 1$
        \For{$f \in \mathbf{g} \; \mathrm{where} \; g_f \neq 0 $} \Comment{Over all features if not doing sparse updates}
          \State $w_f =$ {\footnotesize UPDATE-ACTIVE}($w_f$, $g_f$, $q_f$)
          \State $u_f = n$
        \EndFor
      \EndFor
    \EndFor
    \Function{\footnotesize UPDATE-ACTIVE}{$w$, $g$, $q$}     \Comment{The AdaGrad update}
      \State $[\ell_2]$ $\mathbf{return} \; \frac{w \sqrt{q} - \eta g}{\eta \mathrm{C} + \sqrt{q}}$               
      \State $[\ell_1]$ $\mathrm{d} = |w - \frac{\eta}{\sqrt{q}} g| - \frac{\eta}{\sqrt{q}} \mathrm{C}$           
      \State $[\ell_1]$ $\mathbf{return} \; \mathrm{sign}(w - \frac{\eta}{\sqrt{q}} g) \cdot \max(0, \mathrm{d})$ 
    \EndFunction
%%%    \Function{\footnotesize SCORE}{$y'$} \Comment{Compute $\mathbf{w}^\top \mathbf{f}(y')$, but if doing sparse updates, then}
%%%        \State $s = 0$                   \Comment{for each weight, apply an update to catch up on the}
%%%        \For{$f \in \mathbf{f}(y')$}     \Comment{steps in which the gradient for that weight was zero}
%%%          \State $[\ell_2]$ $w_f = w_f \left(\frac{\sqrt{q_f}}{\eta \mathrm{C} + \sqrt{q_f}}\right)^{n - u_f}$
%%%          \State $[\ell_1]$ $w_f = \mathrm{sign}(w_f) \cdot \max(0, |w_f| - \frac{\eta \mathrm{C}}{\sqrt{q_f}} (n - u_f))$
%%%          \State $u_f = n$
%%%          \State $s \pluseq w_f$
%%%        \EndFor
%%%        \State $\mathbf{return} \; s$
%%%    \EndFunction
    \Function{\footnotesize UPDATE-CATCHUP}{$w$, $q$, $t$}               \Comment{A single update equivalent to a series of AdaGrad}
          \State $[\ell_2]$ $\mathbf{return} \; w \left(\frac{\sqrt{q}}{\eta \mathrm{C} + \sqrt{q}}\right)^t$ \Comment{updates where the weight's subgradient was zero}
          \State $[\ell_1]$ $\mathbf{return} \; \mathrm{sign}(w) \cdot \max(0, |w| - \frac{\eta \mathrm{C}}{\sqrt{q}} t$
    \EndFunction
    \Function{\footnotesize SCORE}{$y'$} \Comment{Compute $\mathbf{w}^\top \mathbf{f}(y')$, but if doing sparse updates, then}
        \State $s = 0$                   \Comment{for each weight, apply an update to catch up on the}
        \For{$f \in \mathbf{f}(y')$}     \Comment{steps in which the gradient for that weight was zero}
          \State $nw =$ \label{alg:line:parallel:start} {\footnotesize UPDATE-CATCHUP}($w_f$, $q_f$, $n - u_f$)
          \If{$u_f \neq n$} \Comment{For parallel decoding, this enables updates without}
            \State $u_f = n$ \Comment{locking. In some cases, an old weight will be used,}
            \State $w_f = nw$ \label{alg:line:parallel:end} \Comment{but $w_f$ will not be updated incorrectly.}
          \EndIf
          \State $s \pluseq w_f$
        \EndFor
        \State $\mathbf{return} \; s$
    \EndFunction
\end{algorithmic}
\end{algorithm}

We train with an online primal subgradient approach \parencite{Ratliff:2007} as described in \textcite{Kummerfeld-etal:2015:EMNLP}.
We compute the subgradient of the margin objective on each instance by performing a structured loss-augmented decode, then uses these instance-wise subgradients to optimize the global objective using AdaGrad \parencite{Duchi:2011} with either \Lone or \Ltwo regularization.
To improve speed, we use sparse updates and batch processing, as described below.

\subsubsection{Batches}

Instead of updating the model with subgradients calculated for single sentences, we consider the sum over a small number of sentences (a batch).
In each pass through the training data we make fewer updates to our model, but each update is based on a more accurate subgradient.
This has the advantage that we can parse all of the sentences in the batch in parallel.
Since most of our time is spent in parsing, this can produce improvements proportional to the number of CPU cores available.
In practice, memory is also a constraint, and CPU utilization is limited by memory bandwidth and possibly memory management by the Java Virtual Machine.

\subsubsection{Sparse Updates}

Not every weight contributes to the score of every parse, but the simplest implementation of AdaGrad modifies every weight in the model when doing an update.
To save time, we distinguish between two different types of update.
When the subgradient for a weight is nonzero, we apply the usual update.
When the subgradient for a weight is zero, we apply a numerically equivalent update later, at the next time the weight is queried.
This saves time, as we only touch the weights corresponding to the (usually sparse) nonzero directions in the current batch's subgradient.
The update that occurs later saves time overall because we can combine more than one update together in a simple closed form calculation.
Algorithm~\ref{alg:ops} gives pseudocode for our implementation.

Sparse updates do add some memory and computation costs.
First, when accessing weights and applying the delayed updates, we need to use synchronization to ensure an exact update.
Second, we need to use memory to track the last time each weight was updated and to provide a lock for each weight (used for synchronization).

If we are willing to give up exactness we can avoid the synchronization delay by applying a lock-free approach similar to \textcite{hogwild}.
Our approach is shown in lines~\ref{alg:line:parallel:start} to \ref{alg:line:parallel:end} of Algorithm~\ref{alg:ops}.
To perform an update we first read all the relevant variables (weight, gradient sum, update time), then calculate the new weight.
Before saving the new weight, we check to see if the time of the last update ($u_f$) is now equal to the current time ($n$).
If it does, we do nothing and continue.
If it doesn't, then we set the new update time, then update the weight (that order is important).
If multiple threads are applying an update simultaneously then the race to do the update can play out in a few ways.
Since the update time is changed before the weight is changed, and all threads are updating to the same new time, we can guarantee that any thread that gets through the time check will be doing the correct update (the worry is that the thread read an inconsistent set of values).
If the time has changed then either another thread already did the work, in which case all is well, or another thread has set $u_f$, but not $w_f$, in which case we may continue with an old version of the weight.
That slight possibility of using an old weight turns out to not be an issue.
We found that avoiding locks did not impact accuracy, but also did not substantially impact speed.

\subsubsection{Loss Function}

In loss-augmented decoding, we find the parse with the highest score when adding the loss of the parse to the model score.
In this context, loss is a measure of the difference between a given parse and the correct parse.
To efficiently find the top scoring parse in this case we need the loss function to decompose in a way that matches our dynamic program.

The performance metric we want to optimize, F-score, cannot be decomposed.
The simplest alternative would be to use the number of incorrect arcs, as we can adjust the score when adding an arc based on whether the arc is correct or not.
For parsing without traces that would be fine, as we are making a fixed number of decisions: one arc and spine per word (\myie the denominator of the metric we are optimizing is constant).
However, when parsing with traces, if there is supposed to be a trace and we leave it out, the mistake is not penalized by the number-of-incorrect-arcs metric.

Instead we use hamming distance, the number of incorrect arcs plus the number of missing arcs.
As above, incorrect arcs are easy to account for.
For missing arcs, we must be careful to count each mistake exactly once.
We can be certain an arc will not be created when a deduction rule is applied that leaves one of the ends in the middle of a span (\myeg when two items are combined the middle point is now in the middle of the span produced).
To avoid counting twice (once for each end) we have a few options:

\begin{itemize}
  \item When combining two halves that would lead to double counting, subtract off to avoid it (at the point of combination we have all the information needed to do so)
  \item Assign half to each end
  \item Only count on one end (options include the left end, the right end, the parent, or the child)
\end{itemize}

\noindent
In our system we use the first approach.

\subsection{Inference} \label{sec:inference}

The core contributions of this entire chapter is the inference algorithm presented in Section~\ref{sec:full-algorithm}.
The core idea behind the algorithm is to constrain the space to consider so that it can be explored efficiently, while still covering the structures observed in language.
Here we describe modifications at various scales that improve speed by pruning the space further.

\subsubsection{State beams}

In each cell of the chart we use a beam, discarding items based on their Viterbi inside score.
We ensure diversity by dividing each beam into a collection of sub-beams.
In all three passes, the sub-beams separate items based on their type ($N$, $L$, etc), and the parents of each position in the item.
This subdivision enables us to avoid considering most incompatible items.
The pass with spines also includes one of the spines in the sub-beam definition for the same reason.
We tuned the beam size to prevent the gold structure being pruned in training, with values falling between 100 and 2000.

\subsubsection{Cube pruning}

We apply the standard cube pruning approach when doing binary and ternary compositions \parencite{Chiang:2007}.
Since we are using sub-beams to determine which items are compatible, we use a heap of sub-cubes during composition.
Using fine sub-beams to avoid comparing incompatible items means that there are many of these sub-cubes, and so we also prune entire sub-cubes based on the score of their top item.
Again, we tuned how far through the cube we go when combining items by increasing the value until gold structures were not being pruned in training, with the value ranging from 500 to 8000.

\subsubsection{Coarse to Fine Pruning}

Rather than parsing immediately with the full model we use several passes with progressively richer structure \parencite{Goodman:1997}:

\begin{enumerate}
  \item Projective parser without traces or spines, and simultaneously a trace classifier
  \item Non-projective parser without spines, and simultaneously a spine classifier
  \item Full structure
\end{enumerate}

Each pass prunes using max-marginals from the preceding pass and scores from the preceding classifier.
The third pass also prunes spines that are not consistent with at least one unpruned edge from the second pass.

\subsubsection{Inside--Outside Calculations}

There are two general algorithm classes for parsing that our algorithm can be used within: Viterbi and Inside--Outside.
The first of these finds the optimal structure for a sentence under a given model and in the process determines the optimal substructure for every span of the sentence.
While that is sufficient for parsing, it does not provide the information we need for pruning.
Instead we use the inside--outside algorithm, which computes for every item either the sum or max over all parses that include that item.

When using our algorithm with a model that only places weights on the edges, calculation of the scores is straightforward.
Each edge exists in only one place in the derivation, between the item without it and the item with it.
Once spines are introduced the situation changes because we would like to score them in all of the items they appear in.
This scoring is important for the beams and cube pruning to be effective; if we only scored spines in one of the items they appear in there would be many ties.

For the projective algorithm case each spine is introduced in exactly two items in the derivation, and so we can simply assign half the score to each.
For the non-projective version the spine may appear in more locations because it needs to be introduced when we add external points.
To correctly calculate the score, and also have effective pruning, we add the complete score every time the spine is introduced and then subtract the score when two items with a spine in common are combined.

\subsubsection{Algorithm rule pruning}

\begin{algorithm}
\vspace{-2mm}
\input{rules-pruned}
\vspace{-10mm}
\caption{\label{fig:rules-to-prune}
Full dynamic program with rules unseen in training boxed and colored.
}
\end{algorithm}

\begin{algorithm}
\vspace{-2mm}
\input{rules-subset}
\vspace{-8mm}
\caption{\label{fig:rules-pruned}
Pruned dynamic program, including only rules observed in the training set, including tighter parent constraints.
}
\end{algorithm}
%%%   0  356191
%%%   1  423587
%%%   2    1230
%%%   3     493
%%%   4    3756
%%%   6       1
%%%   7   87825
%%%   8    1367
%%%   9    1976
%%%  10    2197
%%%  21     369
%%%  22    4626
%%%  29    2756
%%%  31    1160
%%%  33    5539
%%%  34     800
%%%  35    6185
%%%  36     342
%%%  45    1250
%%%  46     123

Many structures that can be generated by our dynamic program are not seen in the data we consider.
To improve speed, we leave out all rules that are not used in the derivation of sentences in the training set\footnote{
Note, this pruning occurs after the parses have been modified to be \oneEC structures, \myie trace edges that created cycles or made non-\oneEC crossings have been removed.
}.
Of the 49,292 specific rules in the algorithm, only 158 are needed to generate all sentences in the training set.
Narrowing down to these does have implications for coverage, but looking at the development set we found only one rule in one parse that was not in the set of 158.

Figure~\ref{fig:rules-to-prune} shows the complete dynamic program from Figure~\ref{fig:complete-dp}, but with rules that can be completely eliminated boxed and colored blue.
We can see several properties:

\begin{itemize}
  \item $B$ items are never created.
  \item $L$ and $R$ items are always immediately combined with other items to create an $I$.
  \item The external point can be a parent or child when an $X$ is created, but additional edges always have it as the parent, and are only added in $N$ items, not $L$ or $R$ items.
\end{itemize}

Pruning the rules in this way further constrains the space of structures that can be formed to some subset of one-endpoint cross graphs.
This subset more closely describes the structures observed in language.
Unfortunately, it is difficult to characterize this space.
Figure~\ref{fig:rules-pruned} shows the remaining rules, with further constraints on parents based on the observed rules (these were hard to show in Figure~\ref{fig:rules-to-prune}).
One description of the space is that it is the set of structures generated by the deduction rules in Figure~\ref{fig:rules-pruned}, but that is not particularly satisfying.
In particular, it may be the case that this set of rules is impacted by variations such as the choice of head rules.

While this pruning opens new questions about the space of structures in parsing, for our purposes it is primarily a way of improving speed.
The drastic reduction in rules leads to a substantial improvement in the time required for parsing.

\section{Results}

\subsection{Algorithm Coverage} \label{sec:results-coverage}

\begin{table}
  \centering
  \begin{tabular}{|lrr|}
    \hline
    Structure Type & Acyclic & Has a cycle \\
    \hline
    \hline
%%%    46.74  18617       tree
%%%    26.37  10504 projective  
%%%     0.87    345 projective  cycle
%%%    24.06   9585        1ec 
%%%     0.37    147        1ec  cycle
%%%     1.55    618      other 
%%%     0.04     16      other  cycle
    Projective Tree & \textbf{46.74\%} & - \\
    Projective Graph & \textbf{26.37\%} & 0.87\% \\
    One-Endpoint Crossing & \textbf{24.06\%} & 0.37\% \\
    Other Graph & 1.57\% & 0.04\% \\
    \hline
  \end{tabular}
  \caption[Number of sentences in the training set that are of each structure type.]{ \label{tab:structures}
    Number of sentences in the training set that are of each structure type.
    Structures that are recoverable using our algorithm are in bold.
    The acyclic / contains a cycle distinction is disjoint, but the four types of structures are not.
    The values shown are the percentage of parses that fall into that class and not the classes above, \myie the total for Projective Graphs is 73.11\% but 26.37\% is the value shown as that is the number of Projective Graphs that are not also Projective Trees.
  }
\end{table}

\begin{table}
\centering
  \vspace{2mm}
  \begin{tabular}{|lrr|}
    \hline
    & \multicolumn{2}{c}{Coverage (\%)} \\
    Approach & Sentences & Edges \\
    \hline
    \hline
    Projective trees without null elements & 26.59 & 96.27 \\
    \hline
    Core representation (\ref{sec:rep-core}) & 76.69 & 98.47 \\
    + Head rule changes (\ref{sec:rep-head}) & 95.76 & 99.53 \\
    + Null reversal (\ref{sec:rep-other}) & 97.17 & 99.59 \\
    + Gapping shift (\ref{sec:rep-other}) & 97.66 & 99.60 \\
    \hline
  \end{tabular}
  \caption[Coverage improvements for parts of our graph representation.]{ \label{tab:coverage}
    Coverage improvements for parts of our representation.
    Core uses the representation proposed in this work, with the head rules from \textcite{cck}.
    Edge results are when removing only the edges necessary to make a parse representable (\myeg removing one edge to break a cycle).
  }
\end{table}

Table~\ref{tab:structures} divides sentences in the training set of the treebank by structure type and whether a directed cycle is present or not.
Structures that are recoverable using our algorithm are bolded.
The structures we consider cover almost all sentences, while projective trees, the standard output of parsers, account for less than half of sentences.

In Table~\ref{tab:coverage} we show the impact of design decisions for our representation.
The percentages indicate how many sentences in the training set are completely recoverable by our algorithm.
Each row shows the outcome of an addition to the previous row, starting from no traces at all, going to our representation with the head rules of \textcite{cck}, then changing the head rules, reversing null-null edges, and changing the target of edges in gapping.
The largest gain comes from changing the head rules, which is unsurprising since \textcite{cck}'s rules were designed for trees, where any set of rules produce valid structures.

\subsection{Problematic Structures}

To understand what structures are still not covered by our approach we manually inspected twenty examples that contained a cycle and twenty examples where the structure did not satisfy the one-endpoint-crossing property.
The reason these are a problem is that our algorithm can only generate graphs that do not contain cycles and that satisfy the \oneEC property.
Adapting to cycles may be possible by adjusting the algorithm, further separating the enforcement of parent relations and the enforcement of connectivity and crossing properties, but this is beyond the scope of this work.
Meanwhile, the \oneEC property is a fundamental assumption that we use to construct the algorithm, in the same way that the tree property is an assumption enabling \textcite{eisner:1996}'s algorithm.

For the cycles, eleven of the cases related to sentences containing variations of NP \emph{said} interposed between two parts of a single quote.
A cycle was present because the top node of the parse was co-indexed with a null argument of \emph{said} while \emph{said} was an argument of the head word of the quote, together these edges create a cycle.
The remaining cases were all instances of pseudo-attachment, which the treebank uses to show that non-adjacent constituents are related \parencite{ptb-guide}.
These cases were split between use of Expletive (5) and Interpret Constituent Here (4) traces\footnote{
For Expletive: ``When a clausal subject is postposed, expletive it appears in the structural subject position.
Characteristic of it-extraposition is that the final clause can replace it.'' -- Section 17 of the PTB annotation guidelines \parencite{ptb-guide}. \\
For Interpret Constituent Here: ``Used to indicate a relationship of constituency between elements separated by intervening material.
For instance, *ICH*-attach is used in `heavy shift' constructions when the movement results in a configuration in which it is impossible to attach the constituent to the phrase it belongs with.'' -- Section 5.4 of the PTB annotation guidelines \parencite{ptb-guide}.
}.

%%% 3 EXP
%%% 5 ICH
%%% 2 NP says
%%% 1 Gapping
%%% 1 list
%%% 8 other
For the cases where the parse structure does not satisfy the one-endpoint-crossing property, it was more difficult to determine trends.
The same three cases, Expletive, Interpret Constituent Here, and NP \emph{said} accounted for half of the issues.
Of the rest, most involved a set of crossing arcs with no clear way to avoid the crossings by adjusting head rules.

\subsection{Parsing Performance}

We implemented a proof-of-concept system to get preliminary results using this algorithm and representation.
We used only first-order features, \myie features do not consider pairs of edges.
Code for both the algorithm and conversion to and from our representation are available (see Appendix~\ref{chp:resources}).

First we considered the standard parsing metric for trees.
After one training pass through sections $2$--$21$ of the PTB on sentences up to length $40$, we get an F-score of $88.26$ on section $22$.
This is lower than other systems, including the \textcite{cck} parser, which scores $92.0$ on all sentences.
However, our result does show that even with simple features and limited training our algorithm can parse at a non-trivial level of accuracy.

\paragraph{Speed}
Pruning thresholds for the arc pass and the trace pruner were tuned to retain $99\%$ of the gold non-trace edges.
With that setting, the first pass prunes all but $0.302\%$ of possible edges, and $53.0\%$ of chart cells.
The trace pruner prunes all but $6.0\%$ of traces.
The threshold for the spine pruner is tuned to retain $99.5\%$ of the gold spines, at which level it prunes all but $6.5\%$ of all spines.
%%%The second pass prunes all but TODO trace edges, and TODO chart cells. % TODO: use new results
%%%The second pass prunes all but $0.2\%$ of trace edges, and $64.3\%$ of chart cells. % TODO: use new results
%%%However, at this threshold while $99\%$ of non-trace edges are retained, only $88.8\%$ of trace edges are retained. % TODO
With these settings, for sentences up to length $40$, it took $4.3$ seconds per sentence.

\paragraph{Accuracy}
For full graph parsing we considered sentences up to length $40$ ($92.3\%$ of the treebank).
On section 22, for unlabeled trace edges, we obtain a precision of $88.3\%$ and a recall of $50.4\%$.
Using \textcite{Johnson:2002}'s metric, which requires the label and span of the source and target nodes in the parse to be correct, we get precision $64\%$ and recall $48\%$.
This is lower than \textcite{Johnson:2002}'s results ($73$ and $63$ on all sentences in section 23).
However, given the non-local properties considered by Johnson's patterns, it would be difficult for our model to do as well.
One potential future direction is to use a forest reranker to incorporate such features, an option that is feasible for our approach and would avoid the constrained parser output Johnson's approach relies on.
However, we have shown that our algorithm can recover trace edges and expect that it can improve with feature development and longer training.

% TODO: get section 23 results