5experiments.tex.bak

\chapter{Material and Methods}\label{experiments}

This chapter presents the corpora used in this research work and the necessary configuration required to use the Word2Vec-$\theta$RARes model and the Word2Vec-ESN classifier for the TRA task. The chapter is organized as follows. First, we will have an overview of the corpora used to carry out the experiments. Then we will see the initialization of the proposed models, necessary before using the models for the TRA task.

\section{Corpora and Pre-processing}\label{corpora}

This section describes the corpora used to perform the experiments with the proposed models for the TRA task. The section also describes the corpus used to train the Word2Vec model.

\subsection{Corpora for the TRA task}

In order to have a fair comparison between the Word2Vec-$\theta$RARes model and  $\theta$RARes model on the TRA task, the same corpora \footnote{https://sites.google.com/site/xavierhinaut/downloads} used by Hinaut et al. \cite{xavier:2013:RT, tra:xavier_hri} with $\theta$RARes model was utilized. Thus, the corpus-45, corpus-373, corpus-462, corpus-90582 containing 45, 373, 462 and 90582 sentences respectively, were used to perform the experiments with the proposed models. 

\paragraph{Corpus-45:} Corpus-45 is a small corpus of 45 sentences. This corpus contains the sentences in grammatical forms, i.e., `N' and `V' tokens were used respectively to represent the nouns and the verbs \cite{xavier:2013:RT}. The corpus also includes the coded meaning of each sentence. The corpus contains the sentences with different grammatical structures i.e. active, passive, object-relative, subject-relative sentences which are represented in the form:

\begin{enumerate}[noitemsep]
\item the N V the N. \# N1-A1 N2-O1
\item the N was V by the N. \# N1-O1 N2-A1
\end{enumerate}

The first part of the construction (before `$\#$' token) is the sentence in its grammatical form. The coded meaning of the sentences is represented after the `$\#$' token. The coded meaning `N1-A1' can be interpreted as; the first noun is an agent of the first verb. The symbols `O' and `R' were used respectively for the roles object and recipient.

\paragraph{Corpus-462 and Corpus-90582:} The sentences in the corpus-462 and corpus-90582 were generated by Hinaut et al. \cite{xavier:2013:RT} using a context-free grammar for English language and used for the TRA task. Each sentence in these corpora have verbs which take one, two, or three clause elements. For example, the sentences, ``The man \textit{jumps}", ``The boy \textit{cuts} an Apple", ``John \textit{gave} the ball to Marie", have verbs with the 1 (agent), 2 (agent and object) and 3 (agent, object and recipient) clause elements respectively. The sentences in the corpora have a maximum of four nouns and two verbs \cite{xavier:2013:RT}. A maximum of one relative clause can be present in the sentences; the verb in the relative clauses could take 1 or 2 clause elements (i.e., without recipient). For example, ``The dog that bit the cat chased the boy". Both the corpus-462 and corpus-90582 have the constructions in the form:

\begin{enumerate}[noitemsep]
\item walk giraffe $<\!o\!>$ AP $<\!/o\!>$ ; the giraffe walk -s . \# [`the', `X', `X', `-s', `.']
\item cut beaver fish, kiss fish girl $<\!o\!>$ APO , APO $<\!/o\!>$ ; the beaver cut -s the fish that kiss -s the girl. \# ['the', 'X', 'X', '-s', 'the', 'X', 'that', 'X', '-s', 'the', 'X', '.']
\end{enumerate}

Each construction in the corpora is divided into four parts. The first part (before `$<\!o\!>$') describes the meaning of the sentence using the semantic words (or open class words) in order of predicate, agent, object and recipient. The second part (between `$<\!o\!>$' and `$<\!/o\!>$') describes the order of the thematic roles of semantic words as they appear in the raw sentence. The third part (between `;' and `\#') contain the raw sentence with verb inflections (i.e. `-s'), and the fourth part is the abstract representation of a sentence with the semantic words removed and replaced with the token `X' \cite{xavier:2013:RT}. 

Corpus-90582 have 90582 sentences along with the coded meaning of each sentence. This corpus is redundant; multiple sentences with different grammatical structure but the same coded meaning and sentence meaning (see fig. \ref{fig:meaning_realtions}). In total, there are only 2043 distinct coded meanings \cite{xavier:2013:RT}. This corpus also has an additional property that along with the complete coded meanings of the sentences, it also include sentences with incomplete meanings. For example, the sentence ``The Ball was given to the Man" have no \textit{`Agent'} in it, and thus the meaning of the sentence is ``given(-, ball, man)". The corpus also contains $5268$ pair and $184$ triplets of ambiguous sentences, i.e., 10536 and 553 sentences respectively. Thus in total there were $12.24 \%$ (i.e., $ 5268 \times 2 + 184 \times 3 = 11088 $) of ambiguous sentences which have the similar grammatical structure but different coded meaning \cite{xavier:2013:RT}.

\paragraph{Corpus-373:} Apart from the corpus-462 and corpus-90582 which were artificially constructed using the context-free grammar, the corpus-373 containing real sentences collected from humans in natural language was also used. Corpus-373 includes the instructions gathered from the participants interacting with a humanoid robot (iCub) in a Human-Robot Interaction (HRI) study of language acquisition conducted by Hinaut et al. \cite{tra:xavier_hri}. In the study, the robot first performs one or two actions by placing the available objects to a location (e.g. left, right, middle) and the participants observes the actions. Then the participants were asked to instruct the robot to perform the same actions again in the natural language. Thus the corpus contains 373 simple or complex instructions to perform single or double actions with temporal correlation (see Action Performing task in Hinaut et al. \cite{tra:xavier_hri} for more details). For example, the instruction ``Point to the guitar" is one action command whereas the instruction ``Before you put the guitar on the left put the trumpet on the right" is a complex command with double actions, where the second action is specified before the first action. A list of 86 closed class words used to filter the semantic words from the sentences is also provided with this corpus \cite{tra:xavier_hri}. Also, unlike corpus-462 and corpus-90582 this corpus does not contain verb inflections. Thus, this corpus is complex enough to test the learnability and generalization ability of the proposed models on the TRA task.

\paragraph{Pre-processing:} As described earlier, unlike the  $\theta$RARes model, the proposed models process the raw sentences. Thus the sentences in corpus-45 are manually pre-processed by replacing the token `N' and `V' with appropriate nouns and verbs such that the coded meaning of the sentence is not changed. For example, the construction ``the N V the N" was changed to ``the man pushed the ball". Recall that the corpus-462 and corpus-90582 contains the sentences where the verbs are represented along with inflections (suffixes ``-s", ``-ed", ``-ing"). Therefore, these sentences in the corpus-462 and corpus-90582 were processed to obtain the raw sentences without verb inflections. Firstly, all the words were lowercased and then the verbs with inflections were replaced by conjugated verbs \footnote{service used to obtain verb conjugations: \text{http://www.scientificpsychic.com/cgi-bin/verbconj2.pl}}. The verb conjugation to be used depends on the inflection used for the verb. For example, the sentences ``The giraffe walk -s" and ``John eat -ed the apple" has been changed to ``The giraffe walks" and ``John ate the apple" respectively. This preprocessing was done because the distributed word representation captured by Word2Vec model already captures these syntactic relations which were imposed previously using verb inflections e.g., \textit{vector(`walks') - vector(`walk') $\approx$ vector(`talks') - vector(`talk')}. 
 
\subsection{Corpus for training Word2Vec model}

To train the Word2Vec model, a general purpose dataset i.e. Wikipedia corpus \footnote{https://dumps.wikimedia.org/enwiki/latest/} ($\approx$ 14 GB) was used to obtain the low dimensional distributed embeddings of the words. The corpus contains 2,65,8629 unique words. The reason for choosing the Wikipedia dataset to train the model is that it contains most of the words used in the English language. Thus we can have the vector representation of almost all the words in the English language. Also, the Word2Vec model does not give good quality word vectors when trained over a small corpus. Thus a general purpose data set with billions of words is required to have better word embeddings. Once the vector representation of the words in the Wikipedia dataset is obtained, they can also be updated by training the model further on the domain specific dataset (corpus-462 and corpus-90582, etc.) with more bias towards the domain specific dataset. 

\section{Experimental Setup}

\subsection{Obtaining distributed word embeddings} \label{get_word_embeddings}

Before using the proposed models for the TRA task, the distributed embeddings of the words are required to be learned by the Word2Vec model. Thus the Word2Vec model is first trained on the Wikipedia dataset using skip-gram negative sampling (see Chapter \ref{basics} for more details). Skip-gram negative sampling was used because it was claimed to accelerate the training of the model and the resultant word vectors perform better on the word analogy task \cite{w2v:mikolov_2013_efficient, w2v:mikolov_2013_distributed}. 

To obtain the distributed word vectors of different dimensions, six different Word2Vec models were trained. The hidden layers of each model contain 20, 30, 50, 100, 200, 300 neurons respectively. Thus each model learns the distributed word vectors corresponding to the size of the hidden layer. All the six models were trained using the same hyperparameters as follows. A standard context window of $\pm 5$ was used. A bigger context window could also be used to improve the quality of word vectors but at the cost computation time \cite{w2v:mikolov_2013_distributed}. The negative sampling size of $5$ was chosen i.e. five noise words are chosen randomly from the vocabulary which does not appear in the context of the current input word. Additionally, the most frequent words are discarded with the probability using equation \ref{eqn:subsampling} with a subsampling threshold of $t = 10^{-5}$. All the words which appear less than $5$ times in the corpus were also ignored before training the model. To update the network weights, the stochastic gradient descent was used \cite{w2v:parameter_learning, w2v:mikolov_2013_distributed} with the initial learning rate of $0.025$, which drops to $0.0001$ linearly as training progress. These parameter were chosen by following the heuristics discussed in \cite{w2v:mikolov_2013_efficient, w2v:mikolov_2013_distributed}.

The word embeddings obtained by training the Word2Vec model on the Wikipedia dataset are accurate enough to capture the semantic relationship of the words for e.g. vec(`Paris') -vec(`France') + vec(`Germany') $\approx$ vec(`Berlin'). Before training the model on the Wikipedia corpus, a vocabulary of words in the corpus is created. Once the vocabulary is created, and the model is trained, it was not possible to add new words to this vocabulary. However, there is a possibility that when a domain specific corpus (e.g. corpus-462, corpus-373, etc.) is used to train the Word2Vec model further, some of the words may not be present in previously generated vocabulary. Due to this limitation, it was not possible to get the distributed embeddings of these new words. To facilitate online training of the Word2Vec model, there is a need to update the vocabulary of the model if some words are not already present in the vocabulary. Unfortunately, neither the C++ API \footnote{https://code.google.com/archive/p/word2vec/} nor the Gensim python API \cite{w2v:gensim_api} implementation of Word2Vec model supports the vocabulary update, once created. So the online training \footnote{The code is adapted from- http://rutumulkar.com/blog/2015/word2vec} of Word2Vec model was implemented by modifying and extending the Gensim API. The new words that were not present in the existing vocabulary can now be added and initialized with some random weights, and the model can then be trained in the usual manner to learn the distributed embeddings of new words. Although now the vocabulary can be updated in an online manner, the vector embeddings of the newly added words will have poor quality if their count in the new corpus is less. Thus, to have more impact of the domain specific corpus on the word embeddings and to improve the quality of words vectors, the domain specific dataset was repeated several times before training the model \footnote{Idea discussed on: https://groups.google.com/forum/$\#$!topic/gensim/Z9fr0B88X0w}.

So now, when we have an online version of training the Word2Vec model, the training of Word2Vec model can be resumed on the domain-specific corpora (corpus-373, corpus-462 and corpus-90582 ). While training the model on the new dataset, no word was disregarded irrespective of its count in the corpora. This allows us to have the vector embeddings of all the words in our domain specific corpora. Once the Word2Vec model is trained, the generated word embeddings are normalized using L-2 norm before using them for TRA task. One important point to remember while training the Word2Vec model is that if an already trained model has to be trained further on any other corpus, then the word vectors should not be normalized, as it is not possible to train the model again with normalized word vectors \cite{w2v:gensim_api}. The trained Word2Vec model is now ready to be used with the Word2Vec-$\theta$RARes model and Word2Vec-ESN classifier.

\subsection{ESN reservoir initialization}

The size of the reservoir is one of the critical parameters of ESN and it is often recommended to use a bigger reservoir that can be computationally afforded, provided the appropriate regularization method is used \cite{esn:practical_guide}. Thus for the proposed models, a reservoir of 1000 leaky integrator neurons with \textit{tanh} activation function was used (until and unless specified). The input-to-hidden ($W^{in}$), hidden-to-hidden ($W^{res}$) weights were randomly and sparsely generated using a normal distribution with mean 0 and variance 1. For the TRA task, the optional output-to-reservoir weights ($W^{back}$) were not used. Therefore, the reservoir state update equations \ref{eqn:res_update} and \ref{eqn:res_state} thus changes to:

\begin{equation} \label{eqn:res_new_update}
x'(n) =\textit {tanh } ( W^{res}x(n-1) + W^{in}.u(n))
\end{equation}
\begin{equation} \label{eqn:res_new_state}
x(n) =\textit (1-\alpha) x(n-1) + \alpha x'(n)
\end{equation}

To generate the sparse weights, a fixed fanout number of $F_{hh} = 10$ and $F_{ih} = 2$ was used, i.e., each reservoir neuron was randomly connected to $10$ other reservoir neurons and each input neuron was randomly connected to only $2$ other reservoir neurons. Using the fixed fanout scales the computational cost of reservoir state update linearly with increase in the reservoir size \cite{esn:practical_guide}. After the weights initialization, there are several other reservoir parameters which are to be optimized and are task specific. In the next subsection, we will see how these parameters are optimized.

\subsection{Learning ESN parameters} \label{grid_search}

Echo State Network have several parameters to be optimized for the proposed models to perform efficiently on the TRA task. Some of the parameters like reservoir size, sparsity, distribution of non-zero weights are straightforward \cite{esn:practical_guide}. Whereas other parameters like Spectral Radius (SR), Input Scaling (IS) and Leak Rate (LR or $\alpha$) are task dependent and are often tuned by multiple trials. Thus to identify these parameters for the TRA task, a grid search over the parameter space was performed using five reservoir instances (i.e., reservoir weights were initialized using five different random generator seeds). As the parameter search space can be gigantic, a broader grid with wider parameter ranges was first explored to find the optimal grid: grid with low sentence error for Word2Vec-$\theta$RARes model and high F1-Score for the Word2Vec-ESN classifier. The optimal region identified during the broader grid search was then used for the narrow search to determine the optimal parameters \cite{esn:practical_guide}. As both the proposed models (Word2Vec-$\theta$RARes and Word2Vec-ESN classifier) process the sentences differently and have different training objectives, the ESN parameters for both the models were optimized separately. Additionally, the Word2Vec-$\theta$RARes model parameters are optimized separately for SCL and SFL mode. In this section, we will see the parameter optimization on corpus-462, as most of the experiments were performed using this corpus. 

\paragraph{Word2Vec-$\theta$RARes model:} To optimize the reservoir parameters of this model, 462 sentences of corpus-462 with the topologically modified coded meaning (see fig. \ref{fig:w2v_esn_nv}) were used. The choice of using the modified coded meaning was deliberate and will be made clear later in Experiment-6. To get the optimal parameters (i.e. parameters resulting in the lowest sentence error) of Word2Vec-$\theta$RARes model, the model was trained and tested over a range of parameters using 10-fold cross-validation method. The corpus-462 with 462 sentences was randomly split into ten equally sized subsets (i.e. each subset with $\approx$ 46 sentences). The model was trained on sentences from 9 subsets and then tested on the remaining one subset. This process was repeated ten times such that the model was trained and tested on all the subsets at least once. A reservoir of 1000 neurons and a fixed regularization coefficient, $\beta = 1\mathrm{e}{-6}$ for ridge regression was used. By exploring the parameter space, the optimal parameters were identified as $SR = 2.4$, $IS = 2.5$ and $LR = 0.07$ in SCL mode and $SR = 2.2$, $IS = 2.3$ and $LR = 0.13$ in SFL mode. 

\paragraph{Word2Vec-ESN classifier: } For the Word2Vec-ESN classifier, the optimal parameters (i.e. parameters resulting to highest F1-Score) were identified during the grid search using the same corpus-462 and by applying 10-fold cross-validation. Keeping all the other conditions i.e. reservoir size and regularization coefficient, identical to Word2Vec-$\theta$RARes model described above, optimal parameters were identified as $SR = 0.7$, $IS = 1.15$, $LR = 0.1$. As later in the Experiment-2, the performance of the Word2Vec-ESN classifier will also be tested on the sentences transformed to grammatical form (i.e. semantic words replaced with `SW' token) and represented in localist fashion. Thus it is also required to find the optimal parameters for this transformed corpus. Keeping all the conditions same, the grid search was performed over parameter space and the optimal parameters were identified as $SR = 1.3$, $IS = 1.0$ , $LR = 0.4$. 

\subsection{Input and Output coding}

As mentioned in Chapter \ref{approach}, the Word2Vec-$\theta$RARes model and Word2Vec-ESN classifier process the sentences differently. Therefore, the input and output coding for both the model also differs. However, the initialization of Word2Vec model and ESN reservoir weights remains the same for both the models. 

\paragraph{Word2Vec-$\theta$RARes model:}

A raw sentence is presented to the model, where each word in the sentence is processed, word-by-word across time, by both Word2Vec model and ESN. The Word2Vec model outputs the word vector of $E_{v} = 50$ dimension, which is then used as an input for ESN. Thus input layer of ESN has $50$ neurons. For all the experiments on corpus 45, 462 and 90582 (Experiments 1-5 in Chapter \ref{results}), an equivalent but topologically modified coded meaning (see section \ref{sec:model_variant}) was used. The readout layer of ESN contains 24 $(4 \times 3 \times 2)$ neurons as the corpus contains sentences, having a maximum of 4 nouns each having 3 possible roles (Agent, Object, and Recipient) with respect to a maximum of 2 verbs. Each neuron in the readout layer thus codes for a role of a noun with respect to a verb. The corpus-90582 contains a maximum of 5 nouns, and thus the size of readout layer is 30 $(5 \times 3 \times 2)$. 

For Experiment-6, the topologically modified coding was not used. So, the readout layer contains 36 $(6 \times 3 \times 2)$ neurons as there are maximum of 6 semantic words in this corpus and each could take one of the 3 possible thematic roles (Object, Predicate, Location) with respect to a maximum of two actions \cite{tra:xavier_hri}. Output neurons have an activation 1 if the corresponding roles are present in the sentence, -1 otherwise. 

\paragraph{Word2Vec-ESN classifier:}

Recall that in Word2Vec-ESN classifier a raw sentence is presented to the model, where each word (argument) along with the verb (Predicate) with respect to which the word is currently processed, is input to the model across time (see section \ref{sec:model_variant}). Also, a sentence is processed as many times as there are verbs in the sentence. So, the Word2Vec model takes this argument-predicate pair as an input and outputs a vector of $E_{v} = 2 \times 50$ dimension, which is then used as an input to ESN. Thus the input layer of ESN contains $100$ neurons where first $50$ neurons encode the vector representation of the word and remaining $50$ neurons codes for the verb with respect to which word is being processed. Unlike the Word2Vec-$\theta$RARes, the size of readout layer always remains the same and contains five neurons each coding for a role: Predicate (P), Agent (A), Object(O), Recipient (R) and No Role (XX), for both corpus-462 and corpus-90582. Readout neurons of ESN have an activation 1 if the input word-verb (argument-predicate) pair have the corresponding role, -1 otherwise.