1introduction.tex.bak

\chapter{Introduction}\label{introduction}

\ac{TRA} is a supervised learning problem which aims to identify the events and its participants in a sentence to determine \textit{``Who did what to whom"}. In other words assigning roles to the words (arguments) in a sentence with respect to a verb (predicate). The roles typically include an agent, object, recipient, etc. For example, in the sentence \textit{``the dog that gave the rat to the cat was hit by the man"}, the first noun \textit{`dog'} is the agent of verb \textit{`gave'} and object of verb \textit{`hit'}. In \ac{NLP}, the similar problem is studied under the name of \ac{SRL}. Hence, SRL is a form of simplistic semantic parsing which aims to determine the argument-predicate structure for each verb (predicate) in a given sentence \cite{end-to-end}. Understanding the semantics of the text plays an important intermediate step in a wide range of real-world applications such as machine translation \cite{srl:machine_translation}, information extraction \cite{srl:info_extraction:hri}, sentiment analysis \cite{srl:sentiment:wang}, document categorization \cite{srl:text_categorization:persson}, human-robot interaction \cite{tra:xavier_hri,srl:info_extraction:hri} etc.

\section{Previous Work}

Many successful traditional systems consider \acs{SRL} as a multiclass classification problem and use linear classifier such as Support Vector Machines (SVM) to tackle the problem \cite{Koomen:2005,srl:pradhan:2004,pradhan:2005}. These systems are based on the pre-defined feature templates derived from syntactic information obtained by parsing and producing parse trees of the sentences in a training corpus. However, in a study it was found that the use of syntactic parser certainly leads to the degradation of thematic roles predictions \cite{pradhan:2005}. This is because the parsers themselves are not always completely accurate.  Also, designing of feature templates needs a lot of heuristics and is a time-consuming process. The pre-defined features are often required to be iteratively modified depending on how the system performs with the selected features. The feature templates are often required to be re-designed when the task is to be carried out in different languages, corpora or when the data distribution is changed \cite{end-to-end}.

In order to evade manual feature engineering, the \acs{SRL} task was additionally attempted using the neural network models. Collobert et al. \cite{srl:collobert:2011} first attempted to build an end-to-end system without using parsing and using a neural model with word embeddings layer, \ac{CNN} and \ac{CRF}. However, the model was less successful as \acs{CNN}s cannot employ long-term dependencies within a sentence since it can only take into account the words in limited context \cite{end-to-end}. However, to increase the model's performance, they also resorted to using syntactic features by using parse trees of Charniak parser \cite{charniak_parser:2000}.

\ac{RNN} has also been used for a wide range of NLP tasks. Recently \ac{ESN}, a variant of \acs{RNN}, have also been used for such tasks. The tasks performed were diverse: from predicting the next word given the previous words to learning grammatical structures and also recently on the TRA task \cite{esn:learn_gs}. A \acs{RNN} makes use of sequential information and acts as a memory unit to capture the information processed in the past \cite{rnn:elman:1990}. The \acs{ESN}s, on the other hand, have several advantages over simple \acs{RNN}s. First, while processing long sequences, the gradient parameter do not vanish vanishes or explodes in \acs{ESN}s \cite{esn:practical_guide} whereas \acs{RNN}s suffers from vanishing gradient problem \cite{rnn:gradiant_problem:bengio}. Second, unlike simple \acs{RNN}s, \acs{ESN}s are computationally cheap as the recurrent layer (reservoir) in an \acs{ESN} is randomly initialized and only the connections from recurrent layer to read-out layers are learned \cite{esn:NIPS:2003, esn:practical_guide}. These advantages of \acs{ESN} over \acs{RNN} make it an excellent choice to be used for the TRA task.

Hinaut et al. \cite{xavier:2013:RT} proposed a generic neural network architecture, the $\theta$RARes model, using RNN based on the reservoir computing approach, namely Echo State Network to solve the TRA task. The proposed architecture models the language acquisition in the brain and provides a robust and scalable implementation on the robotics architecture \cite{xavier:2013:RT,tra:xavier_hri}.  The model is based on the notion of grammatical construction: mapping of word order (surface form) to its meaning. They first transformed the raw sentences by replacing the semantic words (nouns, verbs, etc.) with a unique token (`SW' in this case). The transformed sentences encoded in localist fashion are then input to the model sequentially, word by word, across time along with the coded meaning (i.e. thematic roles of the semantic words) of the input sentence for training. The model learns the thematic roles of all the semantic words in the input sentences during training. During testing, the model predicts the coded meaning of the previously unseen sentences. More details about the $\theta$RARes model will be discussed later in chapter \ref{issues}.

\section{Motivation and Hypothesis}

As described earlier, the $\theta$RARes model does not process the raw sentences but processes the sentences transformed to abstract grammatical form. This transformation of sentences by replacing the semantic words with a `SW' token makes it possible to train the $\theta$RARes model on a small corpus as the `SW' token can be replaced with different semantic words (nouns, verbs, etc.) to form several sentences \cite{xavier:2013:RT}. However, the `SW' token in itself does not carry any semantics and thus does not allow the model to take into account the semantics of the words. Like many other traditional NLP systems, Hinaut et al. \cite{xavier:2013:RT} also treated the words as discrete atomic symbols and used localist vector representation of words as an input to the $\theta$RARes model. Treating each word as a discrete symbol does not provide any relational information to the model which may exist between the words. For example, if the words \textit{`pink'} and \textit{`red'} are represented using localist representation with vectors [1,0] and [0,1] respectively, then the semantic relationship between these two words (i.e. both are colors) is lost \cite{w2v:tensor_flow}. Thus the transformation of a sentence to grammatical form and encoding it using localist representation, deprive a system from the semantic information present in words. This could make it difficult for the model to learn the thematic roles of sentences. The possible limitations of this model are discussed in more detail in Chapter \ref{issues}.

Mikolov et al. \cite{w2v:mikolov_2013_distributed} proposed a neural model, widely known as the Word2Vec model, to learn the distributed representations of words. Word2Vec model learns high quality, low-dimensional vector representation of words from a large corpus in an unsupervised way. The model learns the word embeddings by taking into account the context (neighboring) words. The resulting word vectors of the model encapsulate the syntactic and semantic information about the words. The obtained word embeddings also capture several language regularities and patterns \cite{w2v:language_similarities, w2v:mikolov_2013_distributed} and can be observed by performing the linear operations on the word vectors. For example, \textit{vector(`king') - vector(`man') + vector(`woman') $\approx$ vector(`queen')}. Unlike other neural network models \cite{w2v:glove, srl:collobert:2008, word_vec:turian:2010, word_vec:hinton:2009} for obtaining word embeddings, training Word2Vec is fast, computationally cheap and efficient \cite{w2v:mikolov_2013_distributed}. Training of Word2Vec model and properties of resulting distributed word embeddings will be discussed later in more detail in Chapter \ref{basics}.

This work is motivated by the limitations of localist word representations and transformation of raw sentences into its abstract grammatical form described above. To overcome these limitations, this thesis hypothesizes that processing raw sentences and using the distributed representations of words as an input to ESN could possibly enhance the performance of the TRA task. The distributed word vectors are learned from the context words and encapsulate the semantic and syntactic relationship of the words \cite{w2v:mikolov_2013_distributed}. Thus the exposure of this syntactic and semantic information of words to the ESN could allow the network to learn these dynamics and hence improve the performance on the TRA task.  

\section{Proposed Models}

To test the hypothesis, this thesis proposes an end-to-end neural language model, the Word2Vec-$\theta$RARes model, in extension to $\theta$RARes model for thematic role assignment. The Word2Vec-$\theta$RARes model is a combination of Word2Vec model and \acs{ESN}. The Word2Vec model is trained on a general purpose unlabeled dataset (e.g. Wikipedia) prior to the use of the Word2Vec-$\theta$RARes model for the TRA task. The model receives the raw sentences sequentially and processes a sentence word by word over time. The Word2Vec unit being the first unit in the model receives the input words and generate the distributed word embeddings. The generated word vectors can then be used by \acs{ESN} for learning the thematic roles of the input sentences. Unlike the $\theta$RARes model, the Word2Vec-$\theta$RARes model process the raw sentences. Also, the distributed word vectors are used over localist word representation as an input to \acs{ESN}. 

Apart from the Word2Vec-$\theta$RARes model, this thesis also proposes a Word2Vec-ESN classifier. The classifier only differs from the Word2Vec-$\theta$RARes model in the way the sentences are processed, and the results are evaluated. The input and the output of the classifier, differ from the Word2Vec-$\theta$RARes model. Like Word2Vec-$\theta$RARes, the classifier also processes the sentences sequentially. The input feature to the Word2Vec-ESN classifier is the current word (argument) and the predicate with respect to which it is processed. Thus a sentence is processed as many times as there are verbs in a sentence. Unlike the $\theta$RARes model where output units encode the possible thematic roles of all semantic words in the input sentences, the output units of Word2Vec-ESN classifier encode the possible role (e.g. Predicate, Agent, Object, Recipient and No Role) of the input argument-predicate pairs. For the evaluation of Word2Vec-ESN classifier, the metric proposed for CoNLL-2004 \cite{conll:2004} and CoNLL-2005 \cite{conll:2005} \acs{SRL} task is used. Both the Word2Vec-$\theta$RARes language model and the Word2Vec-ESN classifier are discussed in more detail in Chapter \ref{approach}.

Using the proposed models, this thesis will address the following questions:

\begin{enumerate}
\item \textbf{Generalization}: How good a model is in predicting the thematic roles of the sentences not used for training?

\item \textbf{Model Size}: How the performance of a model is impacted by the size of the model (i.e. reservoir size)?

\item \textbf{Scalability}: How the performance of a model is affected by the increase in corpus size?

\item \textbf{Robustness}: Can a model be used to perform the TRA task on the corpus not used for learning the model's parameters?

\item \textbf{Distributed Vector Dimensions}: What will be the effect on the performance of a model when the distributed word vectors of different dimensions are used?

\item \textbf{Real-time Processing}: How the readout activations of the model are affected while parsing a sentence?
\end{enumerate}
 
\section{Scope and Novelty}

There are several neural network based models to obtain the word embeddings \cite{w2v:glove, srl:collobert:2008, word_vec:turian:2010, word_vec:hinton:2009} but a systematic comparison of them for the TRA task is beyond the scope of this work. To this date, there is no research conducted with this combination of Word2Vec model and \acs{ESN}. This also makes this study novel.

\section{Outline}

This thesis is organized as follows. Chapter \ref{basics}, describes the basics of the Word2Vec model, the Echo State Network model and how they are trained. The chapter also describes the properties of word vectors obtained by training the Word2Vec model and also the control parameters of \acs{ESN}. Chapter \ref{issues} looks upon related work with the primary focus on the $\theta$RARes language model, its limitations, and revisits the motivation and hypothesis for the current work. Subsequently, chapter \ref{approach} proposes the Word2Vec-$\theta$RARes model and the Word2Vec-ESN classifier. This chapter also describes the implementation and the metrics used to evaluate the performance of proposed models. Chapter \ref{experiments} first describes the corpora used to perform the experiments and then describes the necessary initialization of the proposed models before using them for the TRA task. In chapter \ref{results}, the experiments performed, and results obtained using the proposed model on the TRA task are described. In this chapter, we also see the comparison of the results obtained using the proposed models and the $\theta$RARes model. Finally, chapter \ref{conclusion} conclude this study and also makes the suggestions for possible future work.