\newpage

Summary of results

The taxonomic annotation program taxator-tk was shown to obtain very high precision on a number of synthetic and real metagenomes by applying phylogenetic principles. It requires similar reference genome sequences to calculate a phylogenetic neighborhood for annotation. In its initial stage, the provided example workflow has the option to use two different search programs, but the local aligner is exchangeable in order to adapt to sequence data which stem from different experimental procedures. Within the core algorithm (RPA), which is based on pairwise alignment of partial sequences (segments), taxator-tk neither relies on exact scores from the local aligner nor on a complete set of retrieved homologs and there are no related parameters to be set. The RPA was recently adapted to amino acid sequences, so that direct protein alignment can be used for the similarity search without the need to back-translate similarity matches to the nucleotide level. For example, some alternative local alignment programs for identification of similar sequences have been presented lately, which claim to improve the search time by fast protein alignment with a reduced alphabet [@ZhaoRapsearch22011; @HusonPoor2014; @BuchfinkFast2014; @HauswedellLambda2014]. Another advantage of taxator-tk is its independence of curated reference data, in contrast to the standard procedures in phylogenetic analysis using precomputed HMMs or gene families. This comes at the cost of an increased computation time for de-novo phylogenetic structure detection but enables taxator-tk to be applied in less frequent, non-standard situations, for example to analyze communities with eukaryotic content, like algae or fungi.

The probabilistic model for metagenome binning and its software implementation MGLEX make use of many available sequence features to classify contigs to genomes or genomes bins, and we exemplified alternative applications such as genome enrichment and bin analysis. We could also show on benchmark data that the application of the model improved on the results from recent automatic binning procedures, which confirmed our initial incentive to make better use of the available data to recover individual genomes. The model itself is very generic so that it can, in theory, also be applied to non-metagenomic datasets. We designed MGLEX as a subroutine for use in other software to maximize the benefits resulting from future improvements. It should be integrated into more user-friendly applications for genome recovery.

In the conception stage of both methods, we considered that the algorithms scale with large datasets and that they solve well-defined problems. As a commitment to open science, we released the program source codes to the public and used simple and well specified data formats wherever possible. The software ought to be flexible enough to keep pace with the future progress both in experimental protocols and sequencing technologies.

The two methods in this thesis extend available software for analyzing metagenomes. From a methodological perspective, these methods cover several algorithmic fields including sequence alignment, phylogenetics and probabilistic modeling. Each of the articles published in the course of the thesis follows the track to improve on the understanding of metagenomic data. While the binning review [@DrogeTaxonomic2012], see Appendix C, gave an extensive introduction to the different metagenome binning and analysis approaches, the first method article in @sec:full_taxator-tk [@DrogeTaxatortk2014] presented the program taxator-tk, which enables precise taxonomic annotation of entire metagenomes by fast calculation of phylogenetic neighborhoods. The second method article in @sec:full_mglex [@DrogeProbabilistic2017] proposed a statistical classification framework to recover genomes from shotgun-sequenced metagenomes. Applied studies used taxator-tk and demonstrated its utility to inform about taxonomic composition [@BulgarelliStructure2015] and to reconstruct near-complete genomes for a simple community [@DongReconstructing2017]. Finally, a comprehensive comparison of metagenome processing software was conducted as a challenge [@SczyrbaCritical2017] to improve on the overall interpretation of metagenome studies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thesis_260_summary-results.md

thesis_260_summary-results.md

Summary of results

Files

thesis_260_summary-results.md

Latest commit

History

thesis_260_summary-results.md

File metadata and controls

Summary of results