introduction.tex

\documentclass[thesis.tex]{subfiles}

\begin{document}
\chapter{Introduction}
With the initial sequencing of the human genome in 2001 \cite{human_genome}, the blueprint of our species was made publicly available. Researches have utilized this data to understand the cause of genetic diseases and disorders and have made significant progress in the development of more accurate diagnoses and treatment. The current human reference genome, GRCh38 \cite{grch38}, has collected genetic information from several individuals to represent the human genome as a whole. Because DNA is prone to mutations, sequencing data which originates from a set of individuals is bound to contain variation. GRCh38 embed the concept of variation through allowing alternate paths through highly variable regions.\\
\par\noindent
The transition from linear references to a structure which allows alternate paths is an effort to depict the variable nature of genetic information more accurately. As sequencing technology has progressed, the cost of sequencing has dramatically decreased. This has in turn lead to a larger number of individuals being sequenced, which results in the discovery of even more variation. Incorporating this variation into the reference genome leads to a higher level of precision and a better basis when it is utilized for mapping newly sequenced data. To achieve this goal, more general graphs have been proposed as a new standard for modeling reference genomes \cite{extending_reference_assembly_models}. The data structure presented by graphs has the innate property of representing relationships which are more complex than what can be represented by linear models. The level of detail which can be achieved through a more complex representation can be of great value when mapping new data against the reference.\\
\par\noindent
In order to understand genetic information, bioinformaticians have developed methods and tools for interacting with the data. To avoid a decrease in functionality when transitioning into a new model, these need to be adapted to continue providing a usable interface for researchers. This is a process which starts by formulating the problems in a language representable for the new domain and culminates in finding solutions to these problems.
\section{Aims of the thesis}
The master project in itself had a clear goal: Develop an algorithm for aligning text strings against graph based reference genomes. This also implies defining the problem in the context of graphs. The thesis will present both our formulation of the problem and the approach we propose as a solution. Interesting design choices taken throughout the process will be presented through the algorithm itself, the reasoning behind these choices given as formal arguments underway. Additionally, the thesis has two smaller goals:
\begin{itemize}
  \item Validate the correctness of the approach
  \item Perform performance testing and comparisons to other tools on larger datasets
\end{itemize}
To succeed with the two smaller goals, we implemented the algorithm in the \textit{GraphGenome} tool. This tool is available online, instructions on retrieving and using it can be found in Appendix \ref{sec:tool}.\\
\par\noindent
Throughout the development process, we were faced with several decisions regarding the specificity of the problem. In many of these situations, we chose to put an upper bound to the complexity, to end up with a simple, general, formally strict proof-of-concept, which should work as a foundation for  expansions into more specialized applications.  Every time we come across one of these simplifications we discuss the impact it has. In the later parts of the thesis, we reintroduce many of the concepts we have simplified away when discussing the feasibility of the approach in relation to more specific biological problems.\\
\par\noindent
During the master project the article ``Canonical, Stable, General Mapping using Context Schemes'' \cite{canonical_stable_general_mapping_using_context_schemes} was published by Novak et al., discussing an approach to alignment which is similar to the one we present in this thesis. The similarities and differences between the two are granted a large part of the discussion section.
\end{document}