-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbackground.tex
461 lines (436 loc) · 49.8 KB
/
background.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
\documentclass[thesis.tex]{subfiles}
\begin{document}
\chapter{Background}
This chapter will build the foundation for understanding the theory necessary for the remainder of the thesis. We start out with a brief introduction to general biology. Because of the vastness of this field we only cover a small set of elements necessary for understanding the motivation behind the structure and approach presented later. A more thorough presentation can be found in the bibliography \cite[Chapter 1 \& 2]{introduction_to_bioinformatics}\cite{introduction_to_genomics}. We then step into the realm of technology and bioinformatics presenting techniques and concepts necessary in the description of our algorithm. We finish with a more specific presentation on the subject of graph based reference genomes, through discussing a set of articles, the solutions they provide and what we see as unresolved problems.
\section{Genetics}
\textit{Deoxyribonucleic acid} (DNA) is a molecular structure in which living organisms store genetic information. The information is encoded by \textit{nucleotides} bound together by a sugar-phosphate backbone into strands. The nucleotides are smaller molecules which vary based on the nitrogenous base they contain: \textit{Adenine} (A), \textit{Cytosine} (C), \textit{Guanine} (G) or \textit{Thymine} (T). Each of the nucleotides has a \textit{complementary base}; A have T and C have G, with which it can bind to form a \textit{base pair} (bp). Larger numbers of base pairs are typically prefixed as standard SI units\footnote{1.000bp=1kb, 1.000.000bp=1Mb, 1.000.000.000bp=1Gb}. Due to the chemical structure of the nucleotides, a DNA strand can be said to have a direction: Upstream towards the 5' end or downstream towards the 3' end. The DNA molecule is composed of two reverse complementary strands which are connected in a double helix structure. The two strands will have opposing directions, and every base in one of the strands will be joined in a base pair with its complement in the other. Because either of the strands can be easily deduced from the other, DNA is usually represented by only one of them. We can thus view DNA as a linear sequence of discrete units and represent it using text strings containing the four leading letters representing the nucleotides. The text strings representations often also contain the letter N, referencing \textit{aNy base}. The genetic sequence of an individual is called the \textit{genotype}. Observable traits is called the \textit{phenotype}.
\subsection{The central dogma}
The process of transforming the genetic information into large functional biomolecules is called \textit{the central dogma} of molecular biology. The central dogma states that DNA is transcribed into \textit{messenger RNA} (mRNA) which in turn is translated into proteins. mRNA is, like DNA, a sequence of nucleotides consisting of the three bases A, C and G and \textit{Uracil} (U) instead of T. The mRNA can be divided into triplets of nucleotides called \textit{codons}. The cell decodes the mRNA codons and creates strings of amino acids which are transformed into functional proteins. The relationship between codons and amino acids can be looked up in a table called \textit{The standard genetic code} \cite[Chapter 1, p. 6]{introduction_to_genomics}. Only a portion of the nucleotides in DNA act as \textit{coding regions} which make it through the transcription process and code for actual protein sequences. These are also called \textit{exons}. The remaining \textit{non-coding regions} of the genetic sequence are known as \textit{introns}. In humans about 1.3\% of the genome are coding regions \cite[Chapter 4]{introduction_to_genomics}, the rest used to be referred to as \textit{junk DNA}. We now know that the non-coding regions also holds important information \cite{encyclopedia}.
\subsection{Variation}
\label{sec:genetic_variation}
Genetic information is prone to mutations, either as a result of environmental influence or as a consequence of imperfections during DNA transcription. The simplest mutations are \textit{point mutations} which affect a single nucleotide base. Point mutations can either be \textit{Single-nucleotide polymorphisms} (SNPs) where a single base is substituted for another or \textit{insertions} or \textit{deletions} (indels) where a single nucleotide is removed or inserted into the genetic sequence. Mutations can also occur over larger areas of the genome, where longer subsequences can be deleted, inserted, moved or reversed. A final type of mutations is \textit{Copy number variations}, or \textit{repeats}, where a longer sequence of DNA, typically at least 1 kb \cite{copy_number_variation_new_insights_in_genome_diversity}, is repeated a variable number of times.\\
\par\noindent
As mutations happen randomly to individuals in a population, a diversity of genotypes emerges and creates variability within a \textit{gene pool}. These different genotypes give rise to a variety of phenotypes. A subset of these phenotypes can ensure that an individual is better suited for survival and reproduction than others. Given enough time and scarcity in resources, the best suited individuals will survive and pass on their genes to the next generation. This is the process of \textit{natural selection} which is the main driving force behind evolution. Another mechanism in play is \textit{genetic drift} which affects gene frequencies in a gene pool through non-selective, random processes. \\
\par\noindent
Because there are more possible combinations of nucleotide triplets than there are amino acids there exists some overlap between the codons and the resulting amino acid. For instance, the DNA triplets ``CGA'', ``CGC'', ``CGG'',``CGT'', ``AGA'' and ``AGG'', all encode for the amino acid Arginine. In these cases, point mutations can occur without affecting the resulting protein. These mutations are called \textit{synonymous}, the opposing case which alters the amino acid sequence is called \textit{non-synonymous}.
\section{Genetic data, sequencing and string algorithms}
As genetic information is vital for determining the function of an individual, there is an obvious wish to understand this data. A large set of technologies and methods have been developed to collect and analyze the information. This section will present some important concepts for interacting with this data and techniques which play essential roles in the algorithm presented in the subsequent chapters of the thesis. We will also present the MHC region, the genetic region which provided the data we used in testing our approach.
\subsection{Reference genomes}
A \textit{reference genome} is a data structure which contains genetic information for a population, typically for a given species. The reference genome has a set of continuous nucleotide sequences, called \textit{contigs}, combined into larger \textit{scaffolds} which again are combined to form the \textit{genome} of the species. The first reference genomes collapsed samples from several individuals into a linear \textit{consensus sequence} which was representable for the species as a whole. Later reference genomes have been built more flexibly to allow positions on the genome, called \textit{loci}, to have several variants, termed \textit{alternate loci}. A specific variant of a gene is called an \textit{allele}. A \textit{haplotype} is a set of alleles which tend to be inherited together. Reference genomes form what can be seen as an index for the genome of a species and can be used as a reference map when sequencing new genomes\footnote{Covered in section \ref{sec:sequencing}}. The structure of a reference genome can increase computational tractability compared to storing a set of individual genomes, by reducing double-storage of equivalent regions. The reference also provides a mosaic representing genetic variation, which can be useful when doing genetic analysis.\\
\par\noindent
\subsection{The human genome}
\label{sec:human_genome}
The human genome consists of roughly 3Gb. The base pairs are spread over 46 chromosomes and are assumed to contain about 23 000 genes \cite{introduction_to_genomics}. The human reference genome is developed and maintained by the \textit{Genome Reference Consortium} \cite{genome_reference_consortium}; the current version is called the GRCh38 \cite{grch38}. GRCh38 contains 261 alternate loci, spread over 178 out of a total of 238 regions. An average human is estimated to deviate from the reference genome in 10.000-11.000 synonymous sites and 10.000-12.000 non-synonymous sites \cite{a_map_of_human_genome_variation_from_population_scale_sequencing}.
\subsubsection{Major Histocompatibility Complex}
\label{sec:mhc}
The \textit{Major Histocompatibility Complex} (MHC) is a genetic region spanning approximately 4.5-5Mb \cite{improved_genome_inference_in_the_mhc_using_a_population_reference_graph}\cite{canonical_stable_general_mapping_using_context_schemes}. In humans, it is located on chromosome 6 and contains roughly 200 genes \cite{retroelements_in_mhc}. MHC is a region known to contain genes which affect the functionality of the immune system \cite{the_importance_of_immune_gene_variability_in_evolutionary_ecology_and_conservation}. Even more so MHC is known to be a highly variable region, containing variants that are directly associated with disease \cite{variation_analysis_and_gene_annotation_of_eight_mhc_haplotypes}. The high variability creates difficulties when comparing DNA sequences to determine genetic causes of observed disorders, and when determining the origin of a sequence during \textit{sequencing}.
\subsection{Sequencing}
\label{sec:sequencing}
In sequencing, a \textit{sequencing machine} is used on a physical DNA fragment to find the underlying nucleotide sequence. The machines produce short \textit{reads}, typically in the order of a hundred bp \cite{sequencing_platforms}, which are combined into longer sequences through a process called \textit{assembly}. When the sequenced individual belongs to a species which has a reference genome, reads are typically aligned against it to determine their position. The result is a set of variants, positions where the sequenced genome differs from the reference. These variants are often stored in a \textit{Variant Call Format} (VCF) file, a file format constructed for this exact purpose. In cases where no reference exists, overlap techniques \cite{an_eulerian_path_approach_to_dna_fragment_assembly} or de Bruijn graphs\footnote{The concept is presented in section \ref{sec:de_bruijn_graphs}} are often used in what is known as \textit{de novo assembly} \cite[Chapter 1, p. 19]{introduction_to_genomics}.\\
\par\noindent
The different sequencing technologies have varying degrees of errors introduced in their reads \cite{sequencing_platforms}. The errors can take the form of both point mutations and larger structural variations. Reads produced by some high-throughput sequencing machines are typically prone to contain more errors towards the end of the read \cite{errors_start_end}. There exists efficient strategies for both estimating error rates \cite{estimation_of_sequencing_error_rates_in_short_reads} and correcting the reads \cite{error_correction_of_datasets_with_non_uniform_coverage}. These techniques can simplify the process of finding the origin of a read, which in a mapping assembly is done by an \textit{alignment} algorithm.
\subsection{Alignment}
\label{sec:alignment}
Sequence alignment is the process of determining a correspondence between text strings, in this case representing DNA, by mapping the elements from one to the elements of the other according to a \textit{substitution matrix} (Table \ref{fig:substitution_matrix}) to provide a \textit{mapping score}. Throughout this thesis, we will let the notation $mappingScore(c_1,c_2)$ denote the score for mapping two characters $c_1, c_2$ against each other. Alignment of DNA strings has several important applications: As previously mentioned it is utilized as a tool for assembly as well as in genetic analysis and comparison.\\
\par\noindent
The alignment procedure is never allowed to change the order of the elements in the two strings, but can introduce \textit{gaps}. A gap occurs when a character in one of the strings is not mapped to a counterpart in the opposing string (Figure \ref{fig:gap_example}). When a gap occurs, the resulting alignment is penalized according to the length of the gap by a \textit{gap penalty}. We will similarly let the notation $gapPenalty(distance)$ denote the penalty achieved for a gap of length $distance$. Gap penalties come in different shapes, often according to the origin of the data involved. A \textit{linear gap penalty} gives linear penalties related to the gap length. An \textit{affine gap penalty} distinguishes between opening and continuing a gap. A \textit{logarithmic gap penalty} lets the increase in penalty fade as the gap expands. We let which one of these is used, along with the choice of substitution matrix, be defined by a \textit{scoring schema}. A scoring schema can thus also be seen as any structure which provides the $mappingScore$ and $gapPenalty$ functions. The scoring schemas can be based on simple match/mismatch scores, which corresponds to the mathematical \textit{Edit distance problem}\footnote{Covered in detail in the subsequent section}, or more complex scores (As depicted in Table \ref{fig:substitution_matrix}). The complex models typically try to model the probabilities behind the physical processes responsible for change. The computational sequence alignment problem consists of finding the highest scoring alignment for any two strings.\\
\par\noindent
\setlength{\intextsep}{0mm}
\begin{wrapfigure}[15]{l}{0.3\textwidth}
\begin{mdframed}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\texttt{ACGGGCCTA}\\
\texttt{||||\space||||}\\
\texttt{ACGGACCTA}
\end{center}
\end{mdframed}
\captionsetup{skip=-8pt}
\subcaption{An alignment with no gaps, but one mismatch}
\end{subfigure}
\begin{subfigure}[b]{\textwidth}
\begin{mdframed}
\begin{center}
\texttt{ACGGGCCTA}\\
\texttt{||||\space\space|||}\\
\texttt{ACGG---CTA}
\end{center}
\end{mdframed}
\captionsetup{skip=-8pt}
\subcaption{An alignment with a single gap of length 2}
\label{fig:gap_example}
\end{subfigure}
\end{mdframed}
\vspace*{-5mm}
\caption[Examples of aligned text strings]{}
\label{fig:alignments}
\end{wrapfigure}
If more than two sequences are aligned, the result is a \textit{Multiple sequence alignment} (MSA). This is typically done on sequences which are expected to share a common ancestor. The goal is to determine which traits in the individuals arose from the same origins and how the involved species have diverged genetically over time. A final variant of the alignment problem is one involving large databases of sequences, where the algorithms do not only need to find the best alignment between two sequences, but also determine which sequence should be chosen in order to maximize the result. Algorithms for both the preceding variants of the problem typically utilize heuristical methods to decrease the computational complexity.\\
\par\noindent
There exist two distinct versions of the generalized problem: Finding \textit{global alignments}, where two entire strings are aligned against each other, and finding \textit{local alignments}, where a string is aligned against a substring of the other. The two are traditionally solved respectively by the Needleman-Wunsch and Smith-Waterman algorithms which both are based on \textit{dynamic programming}.
\begin{table}[H]
\centering
\caption[The HOXD70 substitution matrix]{The HOXD70 substitution matrix \cite{scoring_pairwise}}
\label{fig:substitution_matrix}
\makebox[\linewidth]{
\begin{tabular}{ccccc}
&\texttt{A}&\texttt{C}&\texttt{G}&\texttt{T}\\ \cline{2-5}
\texttt{A}&\multicolumn{1}{|c|}{\texttt{91}}&\texttt{-114}&\multicolumn{1}{|c|}{\texttt{-31}}&\multicolumn{1}{c|}{\texttt{-123}}\\ \cline{2-5}
\texttt{C}&\multicolumn{1}{|c|}{\texttt{-114}}&\texttt{100}&\multicolumn{1}{|c|}{\texttt{-125}}&\multicolumn{1}{c|}{\texttt{-31}}\\ \cline{2-5}
\texttt{G}&\multicolumn{1}{|c|}{\texttt{-31}}&\texttt{-125}&\multicolumn{1}{|c|}{\texttt{100}}&\multicolumn{1}{c|}{\texttt{-114}}\\ \cline{2-5}
\texttt{T}&\multicolumn{1}{|c|}{\texttt{-123}}&\texttt{-31}&\multicolumn{1}{|c|}{\texttt{-114}}&\multicolumn{1}{c|}{\texttt{91}}\\ \cline{2-5}
\end{tabular}}
\end{table}
\subsection{Dynamic programming}
\label{sec:dynamic_programming}
\textit{Dynamic programming} (DP) is a problem-solving technique where a problem instance is solved by breaking it into smaller subproblems and combining their results. DP is similar to recursion in that every instance is solved by a \textit{recurrence relation} (An example can be seen in equation \ref{eq:ed_recurrence_relation}) which recurses on smaller and smaller problems until a \textit{base case} is found. A base case represents the bottom of the recursion and is a value which can easily be computed without further recursive calls. The main difference between recursion and DP is that the latter usually stores its intermediate results to allow for fast lookups for reoccurring instances. DP is often used as an approach for optimization problems in order to minimize complexity while giving a guarantee for optimal results \cite[Chapter 9]{algorithms_sequential_parallell_and_distributed}.
\begin{table}[!b]
\begin{tabularx}{0.65\textwidth}{ccccccccccc}
&&a&l&g&o&r&i&t&h&m \\ \cline{2-11}
&\multicolumn{1}{|c|}{0}&1&\multicolumn{1}{|c|}{2}&3&\multicolumn{1}{|c|}{4}&5&\multicolumn{1}{|c|}{6}&7&\multicolumn{1}{|c|}{8}&\multicolumn{1}{c|}{9}\\ \cline{2-11}
l &\multicolumn{1}{|c|}{1}&1&\multicolumn{1}{|c|}{1}&2&\multicolumn{1}{|c|}{3}&4&\multicolumn{1}{|c|}{5}&6&\multicolumn{1}{|c|}{7}&\multicolumn{1}{c|}{8}\\ \cline{2-11}
o &\multicolumn{1}{|c|}{2}&2&\multicolumn{1}{|c|}{2}&2&\multicolumn{1}{|c|}{2}&3&\multicolumn{1}{|c|}{4}&5&\multicolumn{1}{|c|}{6}&\multicolumn{1}{c|}{7}\\ \cline{2-11}
g &\multicolumn{1}{|c|}{3}&3&\multicolumn{1}{|c|}{3}&2&\multicolumn{1}{|c|}{3}&4&\multicolumn{1}{|c|}{5}&6&\multicolumn{1}{|c|}{7}&\multicolumn{1}{c|}{8}\\ \cline{2-11}
a &\multicolumn{1}{|c|}{4}&3&\multicolumn{1}{|c|}{4}&3&\multicolumn{1}{|c|}{3}&4&\multicolumn{1}{|c|}{5}&6&\multicolumn{1}{|c|}{7}&\multicolumn{1}{c|}{8}\\ \cline{2-11}
r &\multicolumn{1}{|c|}{5}&4&\multicolumn{1}{|c|}{4}&4&\multicolumn{1}{|c|}{4}&3&\multicolumn{1}{|c|}{4}&5&\multicolumn{1}{|c|}{6}&\multicolumn{1}{c|}{7}\\ \cline{2-11}
i &\multicolumn{1}{|c|}{6}&5&\multicolumn{1}{|c|}{5}&5&\multicolumn{1}{|c|}{5}&4&\multicolumn{1}{|c|}{3}&4&\multicolumn{1}{|c|}{5}&\multicolumn{1}{c|}{6}\\ \cline{2-11}
t &\multicolumn{1}{|c|}{7}&6&\multicolumn{1}{|c|}{6}&6&\multicolumn{1}{|c|}{6}&5&\multicolumn{1}{|c|}{4}&3&\multicolumn{1}{|c|}{4}&\multicolumn{1}{c|}{5}\\ \cline{2-11}
h &\multicolumn{1}{|c|}{8}&7&\multicolumn{1}{|c|}{7}&7&\multicolumn{1}{|c|}{7}&6&\multicolumn{1}{|c|}{5}&4&\multicolumn{1}{|c|}{3}&\multicolumn{1}{c|}{4}\\ \cline{2-11}
m &\multicolumn{1}{|c|}{9}&8&\multicolumn{1}{|c|}{8}&8&\multicolumn{1}{|c|}{8}&7&\multicolumn{1}{|c|}{6}&5&\multicolumn{1}{|c|}{4}&\multicolumn{1}{c|}{3}\\ \cline{2-11}
\end{tabularx}
\caption[An array used to solve the edit distance problem]{The two-dimensional array used for solving the edit distance problem for the strings S=``algorithm'' and P=``logarithm''}
\label{fig:edit_distance_array}
\end{table}\\
\par\noindent
A problem which is typically solved by dynamic programming is the previously mentioned edit distance problem (ED). ED is concerned with finding the minimal amount of substitutions, deletions, and insertions needed to transform one string into another. The algorithm utilizes a two-dimensional array to store the computed values. For two strings $S$ and $P$, every index $[i,j]$ in the table represents the subproblem of finding the edit distance of the substrings $S[0:i]$ and $P[0:j]$.
\clearpage\noindent
In table \ref{fig:edit_distance_array}, the base cases can be found in the first row and column. These are often dropped from the table itself due to the simple nature of their computations. The remainder of the table is filled out with the following recurrence relation:
\begin{equation}
D[i,j] = min
\begin{cases}
D[i-1,j] + 1\\
D[i,j-1] + 1\\
D[i-1,j-1] + score(S[i], P[j])
\end{cases}
\label{eq:ed_recurrence_relation}
\end{equation}
where $score(x, y)$ is an equality function returning 0 if the two elements are equal and 1 in all other cases. The score for the given instance of the problem can be found in the cell with the highest indexes in the bottom right corner.\\
\par\noindent
There are two distinct ways of utilizing Dynamic Programming. A \textit{bottom-up} approach starts at the smallest cases and computes everything until it reaches the actual given problem instance. This corresponds to starting in the top left corner of the edit distance array and computing the cells iteratively moving downwards to the right. A \textit{top-down} procedure starts at the given problem instance and recursively computes every subproblem that is needed. This means starting in the bottom right corner of the two-dimensional array and recursing upwards to the left. For the edit distance problem the choice of approach bears no big significance as every cell has to be computed either way, but there are problems where using top-down can avoid some computations which are irrelevant to the final result. The latter can also be efficient for heuristical methods where an area of the search space can be overlooked.
\subsection{Suffix trees}
A \textit{suffix trie} is a special tree constructed specifically for strings of text, containing vertices representing characters (Figure \ref{fig:suffix_trie}). When creating a suffix trie for a given string, every suffix has a corresponding leaf vertex such that the vertices along the path from root to leaf contain the characters of that suffix. Consequently, every substring has a path starting in the root vertex. A \textit{compressed suffix trie}, or \textit{suffix tree}, is a suffix trie in which every linear path is compressed into a single vertice (Figure \ref{fig:suffix_tree})\footnote{The name ``Suffix tree'' is usually used as a general description of the concept. These are naming conventions from \cite{data_structures_and_algorithm_analysis_in_java}}. Both suffix tries and suffix trees can easily be extended to hold collections of strings \cite[Chapter 20]{algorithms_sequential_parallell_and_distributed}. A naive implementation has a space complexity of $O(s)$, where $s$ is the length of the string (or the total length of all strings if the tree is built from a collection), and a string of length $m$ can be looked up in $O(km)$ time for an alphabet of size $k$ \cite[Section 20.6.1]{algorithms_sequential_parallell_and_distributed}. The tree can be constructed in linear time\cite{online_construction_of_suffix_trees}.
\clearpage\noindent
\begin{figure}[htpb]
\begin{subfigure}[t]{0.49\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}
\node[state,scale=0.8] (q0) {};
\node[state,scale=0.8] [below left=0.6cm and -0.05cm of q0] (q1) {$a$};
\node[state,scale=0.8] [below right=0.6cm and -0.05cm of q0] (q2) {$n$};
\node[state,scale=0.8] [left=0.25cm of q1] (q3) {$b$};
\node[state,scale=0.8] [right=0.25cm of q2] (q4) {$s$};
\node[state,scale=0.8] [below left=0.6 and -0.05cm of q3] (q5) {$a$};
\node[state,scale=0.8] [below left=0.6 and -0.05cm of q1] (q6) {$n$};
\node[state,scale=0.8] [below right=0.6 and -0.05cm of q1] (q7) {$s$};
\node[state,scale=0.8] [below right=0.6cm and -0.05cm of q2] (q8) {$a$};
\node[state,scale=0.8] [below left=0.6cm and -0.05cm of q5] (q9) {$n$};
\node[state,scale=0.8] [below=0.36cm of q6] (q10) {$a$};
\node[state,scale=0.8] [below left=0.6cm and -0.05cm of q8] (q11) {$n$};
\node[state,scale=0.8] [below right=0.6cm and -0.05cm of q8] (q12) {$s$};
\node[state,scale=0.8] [below=0.36cm of q9] (q13) {$a$};
\node[state,scale=0.8] [below left=0.6cm and -0.05cm of q10] (q14) {$n$};
\node[state,scale=0.8] [below right=0.6cm and -0.05cm of q10] (q15) {$s$};
\node[state,scale=0.8] [below=0.36cm of q11] (q16) {$a$};
\node[state,scale=0.8] [below=0.36cm of q13] (q17) {$n$};
\node[state,scale=0.8] [below=0.36cm of q14] (q18) {$a$};
\node[state,scale=0.8] [below=0.36cm of q16] (q19) {$s$};
\node[state,scale=0.8] [below=0.36cm of q17] (q20) {$a$};
\node[state,scale=0.8] [below=0.36cm of q18] (q21) {$s$};
\node[state,scale=0.8] [below=0.36cm of q20] (q22) {$s$};
\path
(q0) edge node {} (q1)
edge node {} (q2)
edge node {} (q3)
edge node {} (q4)
(q3) edge node {} (q5)
(q1) edge node {} (q6)
edge node {} (q7)
(q2) edge node {} (q8)
(q5) edge node {} (q9)
(q6) edge node {} (q10)
(q8) edge node {} (q11)
edge node {} (q12)
(q9) edge node {} (q13)
(q10) edge node {} (q14)
edge node {} (q15)
(q11) edge node {} (q16)
(q13) edge node {} (q17)
(q14) edge node {} (q18)
(q16) edge node {} (q19)
(q17) edge node {} (q20)
(q18) edge node {} (q21)
(q20) edge node {} (q22);
\end{tikzpicture}
\end{center}
\end{mdframed}
\caption{}
\label{fig:suffix_trie}
\end{subfigure}
\begin{minipage}[t]{0.49\textwidth}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}
\node[state,scale=0.8] (q0) {};
\node[state,scale=0.8] [below left=0.6cm and -0.05cm of q0] (q1) {$a$};
\node[rectangle,draw,inner sep=0.2cm,scale=0.8] [left=0.25cm of q1] (q3) {$bananas$};
\node[state,scale=0.8] (q3) [below right=0.6cm and -0.05cm of q0] (q4) {$s$};
\node[rectangle,draw,inner sep=0.2cm,scale=0.8] [right=0.5cm of q4] (q2) {$na$};
\node[rectangle,draw,inner sep=0.2cm,scale=0.8] [below left=0.67cm and -0.05cm of q1] (q5) {$na$};
\node[state,scale=0.8] (q6) [below right=0.6cm and -0.05cm of q1]{$s$};
\node[rectangle,draw,inner sep=0.2cm,scale=0.8] [below left=0.67cm and -0.05cm of q2] (q7) {$nas$};
\node[state,scale=0.8] [below right=0.6cm and -0.05cm of q2] (q8) {$s$};
\node[rectangle,draw,inner sep=0.2cm,scale=0.8] [below left=0.67cm and -0.05cm of q5] (q9) {$nas$};
\node[state,scale=0.8] [below right=0.6cm and -0.05cm of q5] (q10) {$s$};
\path
(q0) edge node {} (q1)
edge node {} (q2)
edge node {} (q3)
edge node {} (q4)
(q1) edge node {} (q5)
edge node {} (q6)
(q2) edge node {} (q7)
edge node {} (q8)
(q5) edge node {} (q9)
edge node {} (q10);
\end{tikzpicture}
\end{center}
\end{mdframed}
\caption{}
\label{fig:suffix_tree}
\end{subfigure}
\begin{minipage}[t]{0.49\textwidth}
\caption[Examples of suffix trees]{The suffix trie (a) and suffix tree (b) of the string ``bananas''}
\end{minipage}
\end{minipage}
\label{fig:suffix_trees}
\end{figure}
\subsection{Approximate string searching}
Given a dictionary and a query string, approximate string searching is the process of finding the elements in the collection which are most similar to the query string by some distance measure\footnote{ED is an example for such a measurement metric} \cite{approx_string_search}. This is a concept which is also referred to as \textit{fuzzy searching}. When storage space is practically unlimited and time efficiency is the main priority, the problem is often solved through efficient storage structures \cite[Chapter 3]{information_retrieval}. Suffix trees and \textit{k-gram indexes} are examples of data structures used for this purpose. If the input data is user-generated search queries, these can be combined with statistical language models for enhanced performance. In applications where storage space is limited, dynamic programming methods can provide efficient solutions \cite{evolutionary_distance}.
\clearpage
\section{Graph based genome representations}
In the ``Genetics'' section we were introduced to the variable nature of genetic data. A linear model provided by text strings seems suboptimal for representing this variation due to its innate lack of flexibility. In this section, we will present graphs as an alternative model for genetic data. Graphs are far more expressive, and thus able to represent more complex relationships between the elements involved. Additionally, if we can rephrase biological questions in graph theoretical settings, we can benefit from the extensive mathematical field of graph theory when searching for solutions to arising problems. However, when changing the underlying structure, a major problem appears: To avoid a decline in functionality, the more complex model calls for more sophisticated variants of existing methods for interacting with the data. Graph-based approaches have been used for some time in the assembly process, and more recently in relation to reference genomes. We will present the work done on both these subjects alongside what we see as some of the remaining unsolved problems. The content of the section is presented in a way which should not require any prior knowledge of graph theory beyond elementary terms, but readers interested in a complete introduction is referred to the bibliography \cite[Chapter 9]{data_structures_and_algorithm_analysis_in_java}\cite[Chapter 11]{algorithms_sequential_parallell_and_distributed} \cite[Chapter 0]{introduction_to_the_theory_of_computation}. Complexity in regards to the graphs and their operations will throughout the thesis be discussed using \textit{big-O} notation \cite[Chapter 2]{data_structures_and_algorithm_analysis_in_java}\cite[Section 3.1]{algorithms_sequential_parallell_and_distributed}.
\subsection{Model}
\setlength{\intextsep}{0mm}
\begin{wrapfigure}{r}{0.65\textwidth}
\begin{mdframed}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[-,>=stealth',shorten >=1pt,auto,node distance=1.4cm]
\node[state,scale=0.7] (q0) {$start$};
\node[state,scale=0.7] [above right=0.2cm and 1cm of q0] (q1) {$A$};
\node[state,scale=0.7] [below right=0.2cm and 1cm of q0] (q2) {$C$};
\node[state,scale=0.7] [right=0.8cm of q1] (q3) {$G$};
\node[state,scale=0.7] [right=0.8cm of q2] (q4) {$T$};
\node[state,scale=0.7] [below right=0.2cm and 1cm of q3] (q5) {$end$};
\path (q0) edge[->] node {} (q1)
edge[->] node {} (q2)
edge[->,bend right=10] node {} (q3)
edge[->,bend left=10] node {} (q4)
edge[->] node {} (q5)
(q1) edge[<->] node {} (q2)
edge[loop above] node {} (q1)
edge[<->] node {} (q3)
edge[<->] node {} (q4)
edge[->, bend right=10] node {} (q5)
(q2) edge[<->] node {} (q3)
edge[loop below] node {} (q2)
edge[<->] node {} (q4)
edge[->, bend left=10] node {} (q5)
(q3) edge[<->] node {} (q4)
edge[loop above] node {} (q3)
edge[->] node {} (q5)
(q4) edge[->] node {} (q5)
edge[loop below] node {} (q4);
\end{tikzpicture}
\end{center}
\end{mdframed}
\captionsetup{skip=-8pt}
\subcaption{A graph with paths corresponding to every possible DNA string}
\end{subfigure}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=1.25cm]
\node[state,scale=0.7] (q0) {$start$};
\node[state,scale=0.7] [right =0.3cm of q0] (q1) {$A$};
\node[state,scale=0.7] [right of=q1] (q2) {$G$};
\node[state,scale=0.7] [right of=q2] (q3) {$C$};
\node[state,scale=0.7] [right of=q3] (q4) {$T$};
\node[state,scale=0.7] [right of=q4] (q5) {$C$};
\node[state,scale=0.7] [right=0.3cm of q5] (q6) {$end$};
\node[state,scale=0.7] [above of=q1] (q7) {$A$};
\node[state,scale=0.7] [right of=q7] (q8) {$G$};
\node[state,scale=0.7] [right of=q8] (q9) {$G$};
\node[state,scale=0.7] [right of=q9] (q10) {$T$};
\node[state,scale=0.7] [right of=q10] (q11) {$C$};
\node[state,scale=0.7] [below of=q1] (q12) {$A$};
\node[state,scale=0.7] [right of=q12] (q13) {$T$};
\node[state,scale=0.7] [right of=q13] (q14) {$C$};
\node[state,scale=0.7] [right of=q14] (q15) {$T$};
\node[state,scale=0.7] [right of=q15] (q16) {$C$};
\path (q0) edge node {} (q1)
edge node {} (q7)
edge node {} (q12)
(q1) edge node {} (q2)
(q2) edge node {} (q3)
(q3) edge node {} (q4)
(q4) edge node {} (q5)
(q5) edge node {} (q6)
(q7) edge node {} (q8)
(q8) edge node {} (q9)
(q9) edge node {} (q10)
(q10) edge node {} (q11)
(q11) edge node {} (q6)
(q12) edge node {} (q13)
(q13) edge node {} (q14)
(q14) edge node {} (q15)
(q15) edge node {} (q16)
(q16) edge node {} (q6);
\end{tikzpicture}
\end{center}
\end{mdframed}
\captionsetup{skip=-8pt}
\subcaption{A graph modelling three distinct genetic sequences}
\end{subfigure}
\end{mdframed}
\caption[Two proposed graph models]{Two proposed graph models displaying flexibility (a) and rigidity (b)}
\label{fig:flexibility_rigidness_tradeoff}
\end{wrapfigure}
Deciding upon the representation of the graph consists of defining the structure of the elements involved, namely the vertices and edges. As the graphs are built from genetic information, the basic building blocks, the nucleotides, should obviously be represented. If the input data is more complex than singular nucleotides, we must represent the relationships between them. Because the input data has variation, the structure needs to tolerate flexibility. There is, however, a risk of making the structures so flexible they present no consistency, and a flexibility/rigidness-tradeoff becomes apparent (Visualized in figure \ref{fig:flexibility_rigidness_tradeoff}). How the structures are defined in detail should be determined through a requirement specification based on the operations which are desirable to perform on them.
\subsubsection{De Bruijn graphs}
\label{sec:de_bruijn_graphs}
In the article ``An Eulerian path approach to DNA fragment assembly'' \cite{an_eulerian_path_approach_to_dna_fragment_assembly}, Pevzner, Tang and Waterman proposes \textit{de Bruijn} graphs as a solution to the problem of finding the correct origin of repeats during fragment assembly. A de Bruijn graph is a structure where vertices represent \textit{k-mers} from an alphabet and edges represent relationships between the k-mers of two vertices (Figure \ref{fig:de_bruijn_graph_example}). Pevzner et al. let the vertices contain strings of length $l-1$ and connect vertices with an edge wherever there exists a read of length $l$ containing both of the substrings. Formulating the problem in this fashion turns it into an \textit{Eulerian path} problem, solvable in polynomial time, rather than the traditional ``overlap-layout-consensus'' method which is equivalent to the NP-complete problem of finding a \textit{Hamiltonian path} \cite[Section 11.1]{algorithms_sequential_parallell_and_distributed}. A great benefit with de Bruijn graphs is that there is no disambiguity: Any legal k-mer has at no point more than one vertice representing it.\\
\par\noindent
A more detailed type of de Bruijn graphs is the colored variant where the origins of edges and vertices are visualized as colors. The entire sequence originating from a single individual sample can be seen by following a path with a given color. Similarities between samples can be seen as multicolored stretches; variation takes the form of bubbles. Colored de Bruijn graphs can be used for de novo assembly as a more powerful method for detecting variation than traditional assembly techniques \cite{de_novo_assembly_and_genotyping_of_variants_using_colored_de_bruijn_graphs}.
\subsubsection{Sequence graphs}
\label{sec:sequence_graphs}
The relationship between a de Bruijn graph and the sequences it represents is not immediately apparent. A more intuitively pleasing representation is a graph where every vertice contains exactly one nucleotide (Figure \ref{fig:sequence_graph_example}), a concept called \textit{partially ordered graphs} by Lee et al. \cite{multiple_sequence_alignment_using_partial_order_graphs} and \textit{sequence graphs} by Paten et al. \cite{mapping_to_a_reference_genome_structure}. In this representation, the underlying connection between the characters of the text string and the vertices of the graph is more apparent. The representation does, however, have a major disadvantage when compared to de Bruijn graphs: The concept of uniqueness. A vertice can no longer be identified solely by the data it contains. To solve this problem the vertices can be given ids, for instance UUIDs as proposed by Paten et al. Even though these IDs can be used to identify a vertex they contain no information regarding the relationships between elements. The difficulties presented by this problem will be the basis for the subsequent section on \textit{mapping}.
\begin{figure}[!ht]
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[->,auto,node distance=1.5cm]
\node[state] (q0) {$ACG$};
\node[state] [right of=q0] (q1) {$CGT$};
\node[state] [right of=q1] (q2) {$GTT$};
\node[state] [right of=q2] (q3) {$TTT$};
\node[state] [right of=q3] (q4) {$TTA$};
\node[state] [right of=q4] (q5) {$TAC$};
\node[state] [right of=q5] (q6) {$ACG$};
\node[state] [below of=q2] (q7) {$GTG$};
\node[state] [below of=q3] (q8) {$TGT$};
\node[state] [below of=q4] (q9) {$GTA$};
\path (q0) edge node {} (q1)
(q1) edge node {} (q2)
edge node {} (q7)
(q2) edge node {} (q3)
edge[bend left=50,->] node {} (q4)
(q3) edge node {} (q4)
(q4) edge node {} (q5)
(q5) edge node {} (q6)
(q7) edge node {} (q8)
(q8) edge node {} (q9)
(q9) edge node {} (q5);
\end{tikzpicture}
\end{center}
\end{mdframed}
\subcaption{A de Bruijn graph with $k=3$}
\label{fig:de_bruijn_graph_example}
\end{subfigure}
\begin{subfigure}[t]{\textwidth}
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[->,auto,node distance=1.3cm]
\node[state] (q0) {$A$};
\node[state] [right of=q0] (q1) {$C$};
\node[state] [right of=q1] (q2) {$G$};
\node[state] [right of=q2] (q3) {$T$};
\node[state] [right of=q3] (q4) {$T$};
\node[state] [right of=q4] (q5) {$T$};
\node[state] [right of=q5] (q6) {$A$};
\node[state] [right of=q6] (q7) {$C$};
\node[state] [right of=q7] (q8) {$G$};
\node[state] [below of=q4] (q9) {$G$};
\path (q0) edge node {} (q1)
(q1) edge node {} (q2)
(q2) edge node {} (q3)
(q3) edge node {} (q4)
edge[bend left=50,->] node {} (q5)
edge node {} (q9)
(q4) edge node {} (q5)
(q5) edge node {} (q6)
(q6) edge node {} (q7)
(q7) edge node {} (q8)
(q9) edge node {} (q5);
\end{tikzpicture}
\end{center}
\end{mdframed}
\subcaption{A Sequence graph}
\label{fig:sequence_graph_example}
\end{subfigure}
\caption[A de Bruijn graph and a sequence graph]{A de Bruijn graph and a sequence graph representation of the strings "ACGTTTACG", "ACGTGTACG" and "ACGTTACG"}
\end{figure}
\subsection{Mapping}
\label{sec:mapping}
One of the operations which are necessary to perform, and therefore should be part of a requirement specification for the model, is mapping. We will in this thesis define mapping and alignment as two separate concepts, although the two terms are often used isomorphically. We let mapping be the process of finding relationships between single characters of a string and single elements of a reference genome. We define alignment as the process concerned with finding relationships between consecutive elements of an input string and substructures in the reference genome. In the context of linear strings, mapping is easy: Every string has the same underlying coordinate system, represented by the positions of the characters. This means two elements from two separate sequences are either in the same position or they are not. When they are not, the difference in position can be derived from the difference between the indexes, a measure of the distance from the beginning of the string to the position.\\
\par\noindent
If we assume the indexation system proposed in the last section, the indexes of a graph have only one property: Uniqueness. They do not hold the intrinsic value of describing relations between vertices. A rigid mapping system using fixed coordinates would face problems when dealing with a fluent graph that can merge in new information as the internal relationships are bound to change. In de Bruijn graphs the problem is solved by moving the mappable quality away from positions and into the data: For any possible k-mer there either is a corresponding vertice or there is not. In sequence graphs, where singular nucleotides are the most basic building block, there exists an equal number of identically scoring positions for every base as there are vertices containing that base in the graph.\\
\par\noindent
Paten et al. \cite{mapping_to_a_reference_genome_structure} introduce the concept of \textit{context-based mapping} as a solution to the mapping problem when the reference is modeled as a sequence graph. Context-based mapping is an approach where a vertex is identified by the surrounding environment in the graph. More technically a vertex has a set of \textit{contexts} which are tuples (L, B, R). The L references the left side, a path coming into the vertex. B is the base contained in the vertex itself. R is an outgoing path from the vertex (Figure \ref{fig:contexts_example}). More conceptually, contexts can be seen as paths which pass through a given vertex. Because these paths are linear and pass through vertices containing characters, the contexts can be treated as text strings. There are two concrete examples of approaches presented in the article: The \textit{general left-right exact match mapping scheme} and the \textit{central exact match mapping scheme}. The keywords left-right and central refer to how a vertex defines its contexts based on the surroundings. The former examples define separate contexts for incoming and outcoming paths whereas the latter defines the vertex as the center of a path where the differences of the lengths of the two contexts are minimized. The \textit{balanced central exact match mapping scheme} is a special case of the centralized scheme where both contexts are the same length, and the vertice thus is the center of a k-mer. This is a concept closely related to de Bruijn graphs.\\
\par\noindent
\begin{figure}[!hb]
\begin{mdframed}
\begin{center}
\begin{tikzpicture}[->,auto,node distance=2.7cm]
\node[state] (q0) {$A$};
\node[draw=none,align=center] [below=0cm of q0] (x0) {$L=\emptyset$\\$R=\{"TG""\}$};
\node[state] [right of=q0](q1) {$T$};
\node[draw=none,align=center] [above=0cm of q1] (x1) {$L=\{"A"\}$\\$R=\{"GT","GC"\}$};
\node[state] [right of=q1](q2) {$G$};
\node[draw=none,align=center] [below=0cm of q2] (x2) {$L=\{"TA"\}$\\$R=\{"TG", "CG"\}$};
\node[state] [above right of=q2] (q3) {$T$};
\node[draw=none,align=center] [above=0cm of q3] (x3) {$L=\{"GT"\}$\\$R=\{"C"\}$};
\node[state] [below right of=q2] (q4) {$C$};
\node[draw=none,align=center] [below=0cm of q4] (x4) {$L=\{"GT"\}$\\$R=\{"C"\}$};
\node[state] [below right of=q3] (q5) {$G$};
\node[draw=none,align=center] [below=0cm of q5] (x5) {$L=\{"CG", "TG"\}$\\$R=\emptyset$};
\path (q0) edge node {} (q1)
(q1) edge node {} (q2)
(q2) edge node {} (q3)
(q2) edge[bend left=40] node {} (q4)
(q3) edge node {} (q5)
(q4) edge[bend left=40] node {} (q5);
\end{tikzpicture}
\end{center}
\end{mdframed}
\caption[A sequence graph with contexts]{A sequence graph with contexts of length 2 explicitly shown next to the vertices. All contexts are seen in the direction of traversal away from the vertex}
\label{fig:contexts_example}
\end{figure}
\clearpage\noindent
Both of the examples use the word \textit{exact} in their definitions. The term refers to the fact that every context is \textit{unique} to a single vertex, meaning every possible context either maps unambiguously to a single vertex or does not map at all. As the graphs have the possibility of branching, a vertex can have several contexts contained in a \textit{context set}. Because every context is unique, a collection of such will also be unique. The results is a two-way unique mapping schema. This is an even stronger notion of mappability than positions in strings, as a character of a string does not necessarily map uniquely back to its position. Being this precise in the definition has a drawback: If a vertex does not have a unique context set, it is no longer mappable. In spite of this, the context-based approach presents a precise and efficient solution to the mapping problem. This is knowledge we will bring with us when we move on to considering the more complex alignment problem.
\subsection{Alignment}
As previously discussed, alignment of text strings has for some time been considered solved. We let the two strings represent each their dimension in a two-dimensional space and search for a path through the space which yields an optimal score. When one of the strings is replaced with a graph, a simple two-dimensional representation is no longer sufficient. The ``3 steps before'' or ``11 steps after'' relationship found in strings is no longer as easily derivable. A solution to this problem can be to imagine alignments against graphs like alignments against sequences in a database: There exist several possible sequences which can be aligned two-dimensionally, find the one yielding the highest score. But, unlike individual sequences in a database, the paths through a graph can have overlapping regions. Creating all possible paths would result in an exponential number of possibilities which does not necessarily portray a fair picture of the underlying structure.
\subsubsection{Dynamic programming on graphs}
\label{sec:dp_on_graphs}
The article ``Multiple sequence alignment using partial order graphs'' \cite{multiple_sequence_alignment_using_partial_order_graphs} proposes a direct adaptation of the regular two-dimensional dynamic programming solution for graphs through the \textit{Partial Order Multiple Sequence alignment} (PO-MSA) algorithm. Every vertex contains a one-dimensional array representing the string which is to be aligned. Just like a single array-index in the edit distance problem, the vertex looks at smaller subproblems to decide what the values of the array should be. However, because this is a graph and not a string, it is no longer sufficient to look up the preceding index. The vertex has to look at every preceding vertex as a single instance of the two-dimensional problem, to determine which of the incoming vertices represents the linear path which presents the highest score. After filling out every index $i$ of the array in the vertex $v$ in this fashion, the array represents the highest score possible for the substring $S[0:i]$ for all paths ending in $v$.\\
\par\noindent
Using an approach which is this closely related to the known approaches for regular string alignment has its advantages. Alignments and scores are verifiable through existing tools, and the principle of optimality is contained through the dynamic programming principles. Techniques for handling the different types of alignments, for instance local or global, can be inherited from the domain of strings. The algorithm is, however, a crude adaptation and thus susceptible to the inherent complexity of graphs.
\label{sec:po_msa}
\subsubsection{Context-based alignment}
\label{sec:canonical}
In the article ``Canonical, Stable, General Mapping using Context Schemes''\cite{canonical_stable_general_mapping_using_context_schemes} the previously mentioned concept of context-based mapping is used for aligning entire strings. The algorithm works by identifying substrings of the input string which align uniquely to a context in the reference. Overlapping contexts are combined into longer \textit{Maximal Unique Substrings} (MUMs) which uniquely align to a region of the reference. Finally, the aligned substrings are combined in chains into $\beta$-synteny blocks, paths along the graph where exactly $\beta$ mismatches are allowed between the uniquely mapped elements. Any remaining bases are mapped \textit{on credit}, for instance as a regular graph search through the region represented by the gap between the end and start of to consecutive uniquely mapped subsequences. The conceptual idea is that any string mapping to a region of the graph should share a number of unique paths, which can be combined into a larger result. The authors name their heuristical approach the \textit{$\alpha-\beta$-Natural Context-Driven Mapping Scheme}. The introduction of the $\alpha$ and $\beta$ variables allows for a regulation of the strictness of uniqueness and presents a powerful approach for alignment against complex reference structures. However, it is still a heuristic; based on a non-tautological assumption for the input data.
\end{document}