-
Notifications
You must be signed in to change notification settings - Fork 0
/
algorithm.tex
347 lines (336 loc) · 39.2 KB
/
algorithm.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
\documentclass[thesis.tex]{subfiles}
\begin{document}
\chapter{Implementation}
In this section we will present the implementation of our algorithm which can be found in the \textit{GraphGenome} tool (Appendix \ref{sec:tool}). The algorithm will be coupled by an explicit example found in figures \ref{fig:explicit_contexts}-\ref{fig:candidate_nodes} and table \ref{fig:scoring_arrays}. Throughout this chapter, and in the example case, the scoring schema is assumed to be what we have called the \textit{negated edit distance} schema. The schema takes its values from the regular edit distance problem\footnote{match=0, mismatch=1, gap opening and extension penalty=1} and negates them. The reason for the negation is that the tool is implemented for generalized scoring schemas which attempt to maximize alignment scores. Otherwise, the scoring schema is chosen due to its intuitive nature: A final score of $-x$ means there are exactly $x$ differences. The scoring schema is also practical for doing complexity analysis: A gap which is penalized by $y$ means traversing exactly $y$ vertices or indexes.\\
\par\noindent
The chapter is divided into two main sections. The first explains the implementation of the algorithm corresponding to the previous chapter. The second is a brief overview of how the tool merges sequences into the graph after they have been aligned. This section is included to better provide an intuition as to how the graphs are built and thus what readers can expect when seeing the results in the suceeding chapters and using the tool themselves. Additionally, after the two main sections describing the steps of the algorithm, we will briefly present some modifications of the implementation which are referenced in the remaining chapters of the thesis.
\section{Aligning sequences}
The alignment process consists of the two steps described in the previous chapter, which each have their corresponding subsection. In addition to these, the tool needs to do a precomputation of the graph in order to build a searchable index. This is not counted as a step in the alignment process as the precomputation is dependant only on the graph and the index is thus reusable for several alignments.
\subsection{Building the index}
There are two data structures needed for aligning a string against the graph: A suffix trie for left contexts and a suffix trie for right contexts. Before either of the two are built the algorithm needs to decide a length for the contexts. In the tool there are two ways of setting the context length: A user given parameter or an approximation based on the probability of sharing contexts (Appendix \ref{sec:birthday_problem}). The length of the contexts does not impact the quality of the alignments found by the algorithm (Shown in the proof in Appendix \ref{sec:proof}) but will have an impact on the runtime (Calculations can be found in Appendix \ref{sec:complexity_analysis}).\\
\par\noindent
When a context length $|c|$ is set, the algorithm can start building the index. Two sets of strings, a left context set and a right context set, are generated for every vertice in the graph $G$. Because the algorithms for the two are equal, aside from the starting point and the direction of the traversal, the following explanation only describes one of them.\\
\par\noindent
The generation of the left contexts start by adding an empty context $\epsilon$ to the context set $c(v_i)$ for every vertice $v_i$ which is a neighbour to $s_G$, and inserting these vertices in a regular FIFO-queue. The contexts are stored in an individual array of sets of strings where the indexes correspond to the indexes of the vertices. The algorithm marks $s_G$ as finished in a boolean table and starts iterating over the elements of the queue.\\
\par\noindent
When a vertice $v_x$ is popped from the top of the queue the first operation consists of checking whether the context set of the vertice is done. A context set is complete if every vertice in the incoming neighbour set $n_i(v_x)$ of the vertice is marked as finished. If the context set is not complete the vertice reinserts itself at the end of the queue. In the opposite case, when a vertice is deemed as ready, it starts generating contexts for its outgoing neighbours $n_o(v_x)$. The vertice takes every context $c$ belonging to its own context set $c(v_x)$ and creates a new context $c'$ by prepending the base $b(v_x)$ to the string. If necessary, when the length of a new context exceeds $|c|$, the trailing character of the string is also removed. Each of these newly created contexts are added to the context sets of every outgoing neighbour $v_y \in n_o(v_x)$ and $v_y$ is added to the queue. Every vertice should be enqueued at most once to avoid an exponential growth in queue size. This is enforced through efficient lookup of currently enqueued vertices in a hash set. The final step for the vertice is marking itself as finished.\\
\begin{center}
\begin{algorithm}[!t]
\begin{mdframed}
\While{queue does not consist of $t_G$}{
current = queue.pop\\
\If{all incoming neighbours are not done}{
enqueue current;\\
continue;
}
\For{every context $c$}{
\For{all outgoing neighbours}{
suffixes[outgoing.index].put($b(current)$ + $c$);\\
\If{outgoing not in queue}{
enqueue outgoing;
}
}
}
finished[current.index] = True;
}
\end{mdframed}
\caption{The loop which generates left contexts for a graph}
\end{algorithm}
\end{center}
\par\noindent
In order to avoid making the end vertice $t_G$ mappable the algorithm strictly puts it back at the end of the queue whenever it is seen. When the queue contains a single element, and this element is $t_G$, the algorithm halts. At this point every vertice along every path leading up to $t_G$ has generated its context, and as we know every vertice stems from an input sequence and every input sequence ends in $t_G$ we know every vertice has been visited. $s_G$ and $t_G$ swaps places as start and end-vertices, the definitions of what is an incoming and an outgoing neighbour is switched around, and the algorithm starts again to generate right contexts.\\
\par\noindent
As sets per definition do not allow duplicates, the impact of a branching occuring in the graph will fade away after exactly $|c|$ steps. At this point the difference is trimmed away (Figure \ref{fig:explicit_contexts}), and exponentiality in the context set sizes is avoided. Unlike the method of Paten et al. \cite{mapping_to_a_reference_genome_structure} there are no requirements for contexts to be uniquely mappable to exactly one vertice. Because the last step of the algorithm does an exhaustive search for an incomplete path through all the candidate vertices this presents no difficulties when finding the alignment. Furthermore, dropping this precondition assures every node has two valid contexts and are thus present in both suffix trees.\\
\begin{figure}[t]
\begin{center}
\begin{mdframed}
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3.5cm]
\node[state,align=center,minimum size=3cm,scale=0.6] (q0) {\footnotesize{\{"$G$", "$T$"\}}\\\Large{$A$}\\\footnotesize{\{"$TA$"\}}\\\scriptsize{$2$}};
\node[state,align=center,minimum size=3cm,scale=0.6] [below left of=q0] (q1) {\footnotesize{$\{\epsilon\}$}\\\Large{$G$}\\\footnotesize{\{"$AT$"\}}\\\scriptsize{$6$}};
\node[state,align=center,minimum size=3cm,scale=0.6] [above left of=q0] (q2) {\footnotesize{$\{\epsilon\}$}\\\Large{$T$}\\\footnotesize{\{"$AT$"\}}\\\scriptsize{$1$}};
\node[state,align=center,minimum size=3cm,scale=0.6] [right of=q0] (q3) {\\\footnotesize{\{"$AG$", "$AT$"\}}\\\Large{$T$}\\\footnotesize{\{"$AG$"\}}\\\scriptsize{$3$}};
\node[state,align=center,minimum size=3cm,scale=0.6] [right of=q3] (q4) {\footnotesize{\{"$TA$"\}}\\\Large{$A$}\\\footnotesize{\{"$G$"\}}\\\scriptsize{$4$}};
\node[state,align=center,minimum size=3cm,scale=0.6] [right of=q4] (q5) {\footnotesize{\{"$AT$"\}}\\\Large{$G$}\\\footnotesize{$\{\epsilon\}$}\\\scriptsize{$5$}};
\node[state,align=center] [left=1.5cm of q0] (q6) {$s$\\\scriptsize{0}};
\node[state,align=center] [right=0.3cm of q5] (q7) {$e$\\\scriptsize{-1}};
\path
(q1) edge node {} (q0)
(q2) edge node {} (q0)
(q0) edge node {} (q3)
(q3) edge node {} (q4)
(q4) edge node {} (q5)
(q6) edge node {} (q1)
edge node {} (q2)
(q5) edge node {} (q7);
\end{tikzpicture}
\end{mdframed}
\end{center}
\caption[A small, explicit reference graph]{A small reference graph with left contexts (top) and right contexts (bottom) of length 2 shown}
\label{fig:explicit_contexts}
\end{figure}
\par\noindent
After generating the two context sets for every node, the elements of each one is inserted into their corresponding suffix tree. Every suffix is stored as a key with the index of its originating node as a value (Figure \ref{fig:left_suffix_tree}). In theory every node can have $4^{|c|}$ contexts in each set, in practice a more fair approximation is $b^{|c|}$ where $b$ is the observed branching factor for the graph. If we assume our graphs are too a large degree linear, we can assume $b\approx1$ and approximate $b^{|c|}=1$.\footnote{Actual computations for $b$ can be found in section \ref{sec:runtime_complexity}} The current implementation uses a naive suffix tree implementation where insertion is $O(|c|)$, giving a total number of operations of $b^{|c|}|c|$ per node per context set and $2|G|b^{|c|}|c|$ for the entire graph. The generation process visits $|G|$ vertices once in each direction, which means the entire index can be built in $O(|G||c|)$\footnote{Assuming $b^{|c|}=1$. Explanations of simplifications done in big-O notation can be found \\in the bibilography\cite[Chapter 2]{data_structures_and_algorithm_analysis_in_java}\cite[Section 3.1]{algorithms_sequential_parallell_and_distributed}}.
\begin{wrapfigure}{R}{0.49\textwidth}
\begin{mdframed}
\begin{tikzpicture}
\node[state,align=center,minimum size=1.4cm] (q0) {$\epsilon$\\\scriptsize{\{$1,6$\}}};
\node[state,align=center,minimum size=1.4cm] [below left=0.6cm and 0.6cm of q0] (q1) {$A$\\\scriptsize{$\emptyset$}};
\node[state,align=center,minimum size=1.4cm] [below=0.21cm of q0] (q2) {$G$\\\scriptsize{\{$2$\}}};
\node[state,align=center,minimum size=1.4cm] [below right=0.6cm and 0.6cm of q0] (q3) {$T$\\\scriptsize{\{$2$\}}};
\node[state,align=center,minimum size=1.4cm] [below left=0.6cm and 0cm of q1] (q4) {$G$\\\scriptsize{\{$3$\}}};
\node[state,align=center,minimum size=1.4cm] [below right=0.6cm and 0cm of q1] (q5) {$T$\\\scriptsize{\{$3,5$\}}};
\node[state,align=center,minimum size=1.4cm] [below=0.2cm of q3] (q6) {$A$\\\scriptsize{\{$4$\}}};
\path
(q0) edge node {} (q1)
edge node {} (q2)
edge node {} (q3)
(q1) edge node {} (q4)
edge node {} (q5)
(q3) edge node {} (q6);
\end{tikzpicture}
\end{mdframed}
\caption{The left suffix tree corresponding to the graph in \ref{fig:explicit_contexts}}
\label{fig:left_suffix_tree}
\end{wrapfigure}
\subsection{Generating the candidate graph}
Unlike the index built in the previous step, the candidate graph $G'$ is a function of both the original graph $G$, the input sequence $s$ and the threshold $T$. For every character $s_x \in s$ a left-context string and a right-context string is generated by looking at the $|c|$ surrounding characters. The two context strings are used as a basis for a fuzzy search in its corresponding suffix trie, a recursive function based on PO-MSA. The root node is supplied with a one-dimensional scoring array $scores$ corresponding to the context string $c$, which is initialized with all zeroes. Then, for every child, a new scoring array $scores_b$ is computed by regular edit distance rules: For each index $i$ take the maximal score for either a gap in the graph, a gap in the string or matching the character $c_i$ with the character $b$ contained in the child vertex:
\begin{equation}
scores_b[i] = max \begin{dcases}
scores_b[i - 1] + gapPenalty(1) & \text{\texttt{\# String gap}}\\
scores[i] + gapPenalty(1) & \text{\texttt{\# Graph gap}}\\
scores[i - 1] + mappingScore(c_i, b) & \text{\texttt{\# Mapping}}\\
\end{dcases}
\end{equation}
An important aspect is that this procedure uses the same scoring schema as the one defined for the entire alignment. The newly created array $scores_b$ is supplemented to the same recursive function in the child corresponding to the character $b$. When a leaf vertex is reached the last index of the supplied scoring array corresponds to mapping the entire string $c$ against the entire context achieved by concatenating the characters contained in the path through the tree traversed by the recursion. If the score is higher than the context threshold $T_c$ for the given context string, every index contained in the vertex is stored as a pair on the form $\{index, score\}$. The candidate sets are implemented as maps with the index as a key, which allows us to store the index of every candidate exactly vertex once by only saving the pair which produces the highest value.\\
\par\noindent
In order to also be able to look up contexts which are shorter than the contexts stored in the tree, the suffix tree search implementation has built in a concept we called \textit{max score inheritance}. This concepts allows all suffix trie vertices at a depth greater than the length of the string which is looked up to inherit a maximum score from their parenting vertex. Doing this we can avoid deterioration of the scores as the searched contexts no longer needs to introduce gaps to align against the longer, stored contexts.\\
\par\noindent
Additionally the suffix search does one more optimization. Whenever the context is short, defined in the tool as \textit{shorter than $|c|$}, there will be a large number of matches. The algorithm has to iterate over every single one to subtract its indexes. In these cases the tool simply returns an empty set which is treated as a set containing every single index with a maximal score. But even this seemingly simple operation has pitfalls to avoid. Whenever the provided string has a length $|s| < 2*c+1$ the middle elements will have contexts on either side which is not searched due to their length, which provides two empty sets. To avoid this the suffix tree search has built in a \textit{force} option, which assures at least one set of candidate vertices is always found.\\
\par\noindent
In theory every leaf vertex has to be visited in order to check the score for every represented context in the tree. In practice the tree can be pruned by cutting off the search whenever the \textit{maximal potential score} falls below the threshold $T_c$ for the provided context. The maximal potential score for a branch is found by adding together the currently highest score in the scoring array with the maximal matching score for the remainder of the string. This reduces the number of nodes to be searched from $O(4^c)$ to $O(|c|^{\lambda}$)\footnote{See calculations in Appendix \ref{sec:complexity_analysis}}.\\
\begin{wrapfigure}{L}{0.5\textwidth}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3.5cm]
\node[state,draw=none,align=center] (q0) {$A$\\\scriptsize{LC=$\epsilon$}\\\scriptsize{RC="$TA$"}};
\node[state,draw=none,align=center] [right=-0.15cm of q0] (q1) {$T$\\\scriptsize{LC="$A$""}\\\scriptsize{RC="$A$"}};
\node[state,draw=none,align=center] [right=-0.15cm of q1] (q2) {$A$\\\scriptsize{LC="$TA$""}\\\scriptsize{RC=$\epsilon$}};
\node[state,align=center] [below=0cm of q0] (q3) {$A$\\\scriptsize{$2$}};
\node[state,align=center] [below=0cm of q1] (q4) {$T$\\\scriptsize{$3$}};
\node[state,align=center] [below=0cm of q2] (q5) {$A$\\\scriptsize{$4$}};
\end{tikzpicture}
\end{mdframed}
\caption{$\lambda=0$}
\end{subfigure}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3.5cm,scale=0.6]
\node[state,draw=none,align=center] (q0) {$A$\\\scriptsize{LC=$\epsilon$}\\\scriptsize{RC="$TA$"}};
\node[state,draw=none,align=center] [right=-0.15cm of q0] (q1) {$T$\\\scriptsize{LC="$A$""}\\\scriptsize{RC="$A$"}};
\node[state,draw=none,align=center] [right=-0.15cm of q1] (q2) {$A$\\\scriptsize{LC="$TA$""}\\\scriptsize{RC=$\epsilon$}};
\node[state,align=center] [below=-0.04cm of q0] (q3) {$T$\\\scriptsize{$1$}};
\node[state,align=center] [below=0.3cm of q3] (q4) {$A$\\\scriptsize{$2$}};
\node[state,align=center] [below=0.3cm of q4] (q5) {$T$\\\scriptsize{$3$}};
\node[state,align=center] [below=0.3cm of q5] (q6) {$G$\\\scriptsize{$6$}};
\node[state,align=center] [below=0cm of q1] (q7) {$T$\\\scriptsize{$1$}};
\node[state,align=center] [below=0.3cm of q7] (q9) {$T$\\\scriptsize{$3$}};
\node[state,align=center] [below=0.3cm of q9] (q11) {$G$\\\scriptsize{$5$}};
\node[state,align=center] [below=0.3cm of q11] (q12) {$G$\\\scriptsize{$6$}};
\node[state,align=center] [below=-0.07cm of q2] (q13) {$T$\\\scriptsize{$3$}};
\node[state,align=center] [below=0.3cm of q13] (q14) {$A$\\\scriptsize{$4$}};
\node[state,align=center] [below=0.3cm of q14] (q15) {$G$\\\scriptsize{$5$}};
\end{tikzpicture}
\end{mdframed}
\caption{$\lambda=1$}
\end{subfigure}
\caption[Candidate vertices for aligning a string against \ref{fig:explicit_contexts}]{The resulting candidate sets for mapping the string "ATA" against the reference genome from fig. ~\ref{fig:explicit_contexts} with varying T values}
\label{fig:candidate_nodes}
\end{wrapfigure}
\par\noindent
After the fuzzy search is concluded there are two maps of candidates for every index, one containing the vertices matching the left context and an equivalent for vertices matching the right context. The keysets of these maps are intersected to produce a final candidate set for the index $i$, where the score is created by adding together the scores from the two original sets. During the intersection process the final set can again be pruned by removing all vertices which has a combined score that is lower than the combined threshold $T_c$ for both contexts. If one of the sets are empty, due to pruning in the suffix tree search, we simply emulate the intersection by keeping the non-empty set. Formally we can define the new vertice set $V'$ as an ordered list of sets $V'_i$ for $0 \leq i < |s|$ where
\begin{equation}
V'_i=\{v_x|v_x \in G \land \exists [c \in c(v_x)](contextScore(c, c(s_i)) >= T_{c(s_i)})\}
\end{equation}
where $contextScore$ is a regular 2-dimensional alignment score.\\
\par\noindent
Intuitively this should be the place where the edges are created in order to finish up $G'$. There are two relationships between the vertices which should be represented: The distance between the elements in $s$ and the distance between the vertices in $G$. The first relationship is inherently contained in the indexation of the candidate sets. The second relationship is found through graph searches in $G$. Finding the weights of these edges is however not necessary before the involved vertices are scored in the next step, and the searches can therefore be delayed until that point. This is done in order to avoid having to use space storing data which is only necessary at a single stage of the computation.
\subsection{Searching the candidate graph}
Once we have created the graph we can start the setup for the subsequent search. We move the indexes of the vertices in the candidate sets over to a two-dimensional array $indexes$ where we let one dimension represent the indexes of the string and the second the individual vertices in the candidate set. This is done in order to have the vertices in a structure where we can reference them solely by an index. In addition to the $indexes$ array we create a floating point array $scores$ with exactly the same dimensions, to contain the scores for the individual vertices. Additionally, in order to backtrack and find the optimal alignment, we create an equally sized $backpointer$ array. Conceptually this should store pairs of integers referencing the other indexes in the array, in the tool this is implemented using strings.
\par\noindent
\begin{wraptable}{l}{0.5\textwidth}
\begin{mdframed}
\begin{subfigure}[t]{0.49\textwidth}
\begin{center}
\begin{tabularx}{\textwidth}{ccc}
A & T & A \\ \cline{1-3}
\multicolumn{1}{|c|}{1} & 1 & \multicolumn{1}{|c|}{3} \\ \cline{1-3}
\multicolumn{1}{|c|}{2} & 3 & \multicolumn{1}{|c|}{4} \\ \cline{1-3}
\multicolumn{1}{|c|}{3} & 5 & \multicolumn{1}{|c|}{5} \\ \cline{1-3}
\multicolumn{1}{|c|}{6} & 6 & \multicolumn{1}{|c}{ } \\ \cline{1-2}
\end{tabularx}
\caption{Indexes}
\end{center}
\end{subfigure}
\begin{subfigure}[t]{0.49\textwidth}
\begin{center}
\begin{tabularx}{\textwidth}{ccc}
A & T & A \\ \cline{1-3}
\multicolumn{1}{|c|}{-1} & -2 & \multicolumn{1}{|c|}{-2} \\ \cline{1-3}
\multicolumn{1}{|c|}{0} & 0 & \multicolumn{1}{|c|}{0} \\ \cline{1-3}
\multicolumn{1}{|c|}{-1} & -3 & \multicolumn{1}{|c|}{-2} \\ \cline{1-3}
\multicolumn{1}{|c|}{-1} & -3 & \multicolumn{1}{|c}{ } \\ \cline{1-2}
\end{tabularx}
\caption{Scores}
\end{center}
\end{subfigure}
\begin{subfigure}[b]{\textwidth}
\begin{center}
\begin{tabularx}{\textwidth}{ccc}
A & T & A \\ \cline{1-3}
\multicolumn{1}{|c|}{-1:-1} & 0:0 & \multicolumn{1}{|c|}{1:1} \\ \cline{1-3}
\multicolumn{1}{|c|}{-1:-1} & 0:1 & \multicolumn{1}{|c|}{1:1} \\ \cline{1-3}
\multicolumn{1}{|c|}{-1:-1} & 0:2 & \multicolumn{1}{|c|}{1:1} \\ \cline{1-3}
\multicolumn{1}{|c|}{-1:-1} & 0:3 & \multicolumn{1}{|c}{ } \\ \cline{1-2}
\end{tabularx}
\caption{Backpointers}
\end{center}
\end{subfigure}
\end{mdframed}
\caption[The 4 arrays used by the searching algorithm]{The 4 arrays used by the searching algorithm when using the candidate sets from figure \ref{fig:candidate_nodes}}
\label{fig:scoring_arrays}
\end{wraptable}
The search is initialized by looping over every node $v_x \in V'_0$ with a counter $j$, setting\\
\par
$indexes[0][j]$ = $x$\par
$scores[0][j]$ = $mappingScore(b(v_x), s_0)$\par
$backPointers[0][j]$ = $-1:-1$\\
\par\noindent
The iteration over the elements of the set with the counter $j$ will not be according to any ordering as the elements of the set are inherently not ordered. This is not important to us as the elements of a single candidate set have no internal relation which we want to preserve. However, from this point onward, we have given the elements an order. Although the internal ordering does not hold any value, this is important as we now know the same index in the three separate arrays reference the same vertex.\\
\par\noindent
After setting the values in the easily identifiable base cases we start computing scores for the paths in our graph. The nodes $v_x \in V'_i$ for the remaining candidate sets at $1 \leq i < |s|$ are looped over with $j$ as a counter, and $indexes[i][j]$ is set to $x$. For every such entry a list of pairs is made with other indexes $(i', j')$ such that $i'$ is a preceding index $i'<i$ and $j'$ is another counter variable looping over $indexes[i']$. For every entry-pair $((i, j), (i', j'))$ we produce a score by the scoring function $\theta((i, j), (i', j'))$. The scoring function works by combining the score contained in the preceding entry, $scores[i'][j']$, the gap penalties, and a mapping score for the current index $mappingScore(v_{indexes[i][j]}, s_i)$. The gap penalty is found by combining a gap penalty for a gap of length $i-i'$ and for a gap of length $distance(v_{indexes[i'][j']}, v_{indexes[i][j]})$. The computation of the last gap penalty corresponds to finding the edges, which up to this point have been without interest to the algorithm. We search for these distances by a breadth-first regular graph search which starts in the vertice $n_{indexes[i'][j']}$ and is concluded when we find $n_{indexes[i][j]}$. Whenever we search for more than $\lambda$ steps without finding the target vertice, we can return the current distance multiplied by 2. When a gap penalty is computed for a gap this length the score will always be $2*\lambda$ which means the edge can never be part of a final alignment and is thus not interesting. \\
\par\noindent
The final score stored in $scores[i][j]$ is the maximal achievable score produced by the function $\theta$ for one of the vertice pairs ending in the vertice with index $indexes[i][j]$. $backpointers[i][j]$ is set to the to the index-pair $(i', j')$ responsible for producing this score. The recurrence formulas for the three arrays are thus:\\
\begin{equation}
\begin{array}{ll}
indexes[i][j] = x & n_x \in V'_i\\
scores[i][j] = \underset{i', j'}{max} \theta((i, j), (i', j')) & 0 \leq i' < i, 0 \leq j' < |scores[i']|\\
backPointers[i][j] = \underset{i', j'}{argmax} \theta((i, j), (i', j')) & 0 \leq i' < i, 0 \leq j' < |scores[i']|\\
\end{array}
\end{equation}
where $\theta$ is a scoring function defined as:\\
\begin{equation}
\begin{array}{l}
\theta((x_1, y_1), (x_2, y_2))=scores[x_2][y_2] \\\quad\quad + gapPenalty(x_1-x_2) \\\quad\quad + gapPenalty(distance(n_{indexes[x_2][y_2]}, n_{indexes[x_1][y_1]})) \\\quad\quad + mappingScore(b(n_{indexes[x_1][y_1]}), s_{x_1}))
\end{array}
\end{equation}
\par\noindent
When the iteration ends we will have a score for every candidate vertice in every candidate set. We can iterate over the scores corresponding to the last candidate set, $scores[|s|-1]$ to find the highest alignment score. The optimal alignment ends in the vertice with the index in the corresponding cell in the $indexes$ array. We can then backtrack backwards through the $backPointers$ array, storing the corresponding indexes of the vertices from the $indexes$ array along the way to produce the actual alignment. The entire operation can be done in $O(\dfrac{\lambda^2|s|(|c|^{\lambda}|G|)^2}{4^{|c|^2}})$.\footnote{Precise calculations are done in Appendix \ref{sec:complexity_analysis}} The exponential factor $|G|^2$ is to a large degree cancelled out by $|c|$, which defaults to be algorithmically set by a function dependant on $|G|$. This leaves an operation which is heavily dependant solely on the error margin $\lambda$.
\subsubsection{Finding all optimal alignments}
\label{sec:all_optimal_alignments}
There is only a small adjustment necessary to produce all optimal alignments instead of a single one, but it makes the explanation of and reasoning around the procedure considerably more complex. Briefly, the first step needed is storing a list of backpointers for every index to every preceding index which produces the same, highest score. Then, when backtracking, the algorithm needs to find every maximal score for the last candidate set. For each one of these the algorithm must start a computational branch which corresponds to each final alignment. These branches are further split up whenever faced with a backpointer consisting of more than one index. Every resulting branch corresponds to a single, equally scoring, optimal alignment.\\
\par\noindent
There exists a third variant of the problem where only uniquely aligned reads are classified as alignable. Any read which has more than one possible optimal alignment can be classified as ambiguous and dropped. Again, there is only a small modification needed to accomplish this within the implementation. Once again we allow for branches while backtracking, by storing lists of optimal backpointers. However, this time around we cut the search whenever we discover a branching in the backtracking process. A branch represents several solutions with the same score, and the read is thus classified as unalignable.
\section{Handling invalid threshold values}
Whenever the algorithm is not able to find any alignments with a score higher than $T$ it classifies the input string $|s|$ as unalignable. There are two scenarios where this would happen: Either the path yielding the highest score consists of a series of steps in which each individual step is considered legal (not penalized more than $\lambda$), but the combination is not good enough. The second scenario occurs when the path goes through a step in which there are no legal possibilities for traversal, which will happen when every path goes through an edge which was not found due to pruning and has been given the distance of $2*lambda$. Both the cases are identified through not finding any high enough scores and both result in an \textit{empty alignment}, an alignment where every element of $s$ is unmapped.\\
\par\noindent
\section{Implementing the heuristics}
\label{sec:impl_heuristical}
Section \ref{sec:heuristical_conceptual} in the preceding chapter describe a modification which seeks to utilize the data found in these cases instead of returning an empty alignment. The first step of this process is to remove the validation and consequent rejection of alignments which do not meet the requirements of the scoring threshold. Secondly, because we now need to produce a result even when some candidate sets are empty, we need to iterate over all the scores to decide the starting point of the backtracking. This starting point should be decided as a function of the score produced by the alignment and the gap penalty for skipping eventual empty candidate sets at the end of the string. The most conceptually interesting property of the heuristical modification is how we handle gaps where the graph search for the distance between vertices was cut off before it produced a result. The normal algorithm penalizes these gaps just enough to put them under the threshold. When we discard the threshold this means they are often preferred due to the low penalty. We have no data to support the choice of penalty for these gaps, but it needs to be defined. In some cases our simple measure of distance does not even hold a meaning, for instance if the two vertices are on separate branches in the graph. Finding a useful metric for such could provide useful in the case of \textit{split alignments}\footnote{Input sequences where subsequences align to two distant regions in the graph}, a concept which is further discussed in Chapter \ref{sec:discussion}. We chose the default value of $gapPenalty(|G|)$ to produce as coherent alignments as possible. These modifications are available in the tool through the \texttt{-{}-heuristical=true} parameter.
\section{Parallelization}
\label{sec:impl_parallelization}
Of the two steps of the algorithm, one stand out as a text book case for possible parallelization. Searching for the candidate vertices for an index is done completely separate from the other indexes. An extremly trivial parallelization of this step is implemented through splitting up the indexes into separate threads, reachable through the \texttt{-{}-parallellization=true} parameter. Further parallellization is possible as the recursive searches through the trees are only dependant on the results from the vertices higher up in the trees, there is no exchange of data between the vertices of a layer. This is not implemented in the tool to keep the proof-of-concept presentation of the approach as conceptually simple as possible.
\clearpage
\section{Merging aligned sequences}
\label{sec:merge}
\begin{wrapfigure}{r}{0.5\textwidth}
\begin{mdframed}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tabularx}{\linewidth}{ccccc}
& T & C & A & G \\ \cline{2-5}
T & \multicolumn{1}{|c|}{0} & -1 & \multicolumn{1}{|c|}{-2} & \multicolumn{1}{c|}{-3} \\ \cline{2-5}
C & \multicolumn{1}{|c|}{-1} & 0 & \multicolumn{1}{|c|}{-1} & \multicolumn{1}{c|}{-2} \\ \cline{2-5}
C & \multicolumn{1}{|c|}{-2} & -1 & \multicolumn{1}{|c|}{-1} & \multicolumn{1}{c|}{-2} \\ \cline{2-5}
G & \multicolumn{1}{|c|}{-3} & -2 & \multicolumn{1}{|c|}{-2} & \multicolumn{1}{c|}{-1} \\ \cline{2-5}
\end{tabularx}
\end{center}
\end{mdframed}
\caption{The dynamic programming table for aligning the two strings ``TCAG'' and ``TCCG'' using negated edit distance as a scoring schema}
\end{subfigure}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=1.3cm]
\node[state,scale=0.7] (q0) {$s$};
\node[state,scale=0.7] [above right of=q0] (q1) {$T$};
\node[state,scale=0.7] [right of=q1] (q2) {$C$};
\node[state,scale=0.7] [right of=q2] (q3) {$A$};
\node[state,scale=0.7] [right of=q3] (q4) {$G$};
\node[state,scale=0.7] [below right of=q0] (q5) {$T$};
\node[state,scale=0.7] [right of=q5] (q6) {$C$};
\node[state,scale=0.7] [right of=q6] (q7) {$C$};
\node[state,scale=0.7] [right of=q7] (q8) {$G$};
\node[state,scale=0.7] [above right of=q8] (q9) {$e$};
\path
(q0) edge node {} (q1)
edge node {} (q5)
(q1) edge node {} (q2)
(q2) edge node {} (q3)
(q3) edge node {} (q4)
(q4) edge node {} (q9)
(q5) edge node {} (q6)
(q6) edge node {} (q7)
(q7) edge node {} (q8)
(q8) edge node {} (q9);
\end{tikzpicture}
\end{center}
\end{mdframed}
\caption{The result of aligning and merging the sequences with $T=0$}
\end{subfigure}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=1.3cm]
\node[state,scale=0.7] (q0) {$s$};
\node[state,scale=0.7] [right of=q0] (q1) {$T$};
\node[state,scale=0.7] [right of=q1] (q2) {$C$};
\node[state,scale=0.7] [above right of=q2] (q3) {$A$};
\node[state,scale=0.7] [below right of=q2] (q4) {$C$};
\node[state,scale=0.7] [below right of=q3] (q5) {$G$};
\node[state,scale=0.7] [right of=q5] (q6) {$e$};
\path
(q0) edge node {} (q1)
(q1) edge node {} (q2)
(q2) edge node {} (q3)
edge node {} (q4)
(q3) edge node {} (q5)
(q4) edge node {} (q5)
(q5) edge node {} (q6);
\end{tikzpicture}
\end{center}
\end{mdframed}
\caption{The result of aligning and merging the sequences with $T=-1$}
\end{subfigure}
\end{mdframed}
\caption{Different scoring thresholds $T$ yields different reference graphs}
\label{fig:separate_paths}
\end{wrapfigure}
Whenever a graph is made from a set of genetic sequences there is an expectation of what the graph should look like, where there should be branches, common subsequences and so on. This is decided by how the input sequences are merged together to produce graphs. Although the merge can be seen as the key element in this process, these are desicions which are made purely by the alignment process, which in turn relies on the scoring schema and scoring threshold. Varying these two components can create a variety of graphs (Figure \ref{fig:separate_paths}). Too avoid confusion we have split the two processes, letting the alignment algorithm do all the heavy lifting to keep the merge as as much of a straight forward procedure as possible.\\
\par\noindent
Whenever we have produced an alignment $A$ for a string $s$ and a graph $G$ we can do a merge, a procedure which intuitively should result in the internal structure of $s$ being present in the merged graph $G*$. To make reasoning around the procedure as straight forward as possible, we visualize $s$ as a special sequence graph $G_s$ where every individual character is represented by a single vertex and every consecutive pair of characters is connected with an edge. This is a convenient representation for several reasons: For one the graph has a strict internal direction which we can use to number the vertices according to their position in the string. We let $s_x$ denote the vertex in $G_s$ with position $x$. An important note is that this is a simplified type of the previously defined graphs which does not require start and end-vertices, which means the indexation begins at 0.
\clearpage\noindent
The reason for this simplification is that we can use the alignment $A$ to create an equality operator:
\begin{equation}
eq(s_x, v_y) =
\begin{dcases}
True & If a_x=v_y\\
False & Else
\end{dcases}
\end{equation}
When we have this way of checking equality between the two graphs we can simply let the vertice set $V*$ of the merged graph be the union of the original vertice set $V$ and the vertice set of $G_s$, where every ``new'' vertice is given a new index when merged in. For the edges there are three cases to consider: An edge between two existing vertices, an edge from a new vertice to an existing edge (and its opposite) and an edge between two new vertices. To simplify the procedure the three latter cases can all be generalized as cases where the edge does not already exist, and needs to be created.\\
\par\noindent
The implementation of this procedure is done even more smoothly. Because the string and alignment both have a direction we can move along this direction and merge or create new vertices iteratively. We let a pointer $prev$ denote the index of the previous element in the graph. Because every input sequence which is represented in the graph should have a full path we initialize this to $s_G$. We then iterate over the elements of the alignment. For every index $i$ we start by creating a new vertice $curr$ if $a_i = 0$ and thus unmapped. We also create a new vertice if index is mapped, but the characters $s_i$ and $b(v_{a_i]})$ is not equal. This vertice contains the character $s_i$ and is given a unique index $x$ by the graph. If the index $i$ is mapped and the characters are equal we let $curr$ be the vertice $v_{a_i}$. At this point we have a pair of vertices, $prev$ and $curr$, which representes consecutive characters in $s$ and should thus have an edge between them. We create this edge by inserting $curr$ in the neighbour set of $prev$, and set $prev=curr$ to move the backpointer. When the iteration finishes we have the last vertice in the alignment contained in the backpointer and create and edge from $prev$ to $t_G$ to finish off the full path.\\
\par\noindent
There is one important aspect to be considered when using the merge procedure: Every sequence which is merged in is considered a stand-alone sequence which starts at the beginning of the graph and ends at the end. This is important because it divides the applications of the tool in two: The alignment of short reads and a combined operation of aligning and merging complete sequences. This is important to point out, because we in the previous chapter state that the algorithm is tuned for aligning reads shorter than the reference. As mentioned this does not obstruct the validity of the approach. It is however apparent that the tool is not optimized for these kinds of operations. We envision the approach as a part of a larger process, where smaller reads are aligned, combined and eventually merged. This larger process can for instance be done by overlap techniques as the results from an alignment is a set of unique IDs.
\end{document}