Johannes Dröge^1^, Alexander Schönhuth^2^, Alice C. McHardy^1*^
^1^Helmholtz Centre for Infection Research, Braunschweig, Germany
^2^Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
This is an author-produced version of an article under revision in PeerJ Computer Science. This article version has been adapted to the thesis layout. The original open-access article is accessible by DOI 10.7287/peerj.preprints.2626.
Shotgun metagenomics of microbial communities reveals information about strains of relevance for applications in medicine, biotechnology and ecology. Recovering their genomes is a crucial, but very challenging step, due to the complexity of the underlying biological system and technical factors. Microbial communities are heterogeneous, with oftentimes hundreds of present genomes deriving from different species or strains, all at varying abundances and with different degrees of similarity to each other and reference data. We present a versatile probabilistic model for genome recovery and analysis, which aggregates three types of information that are commonly used for genome recovery from metagenomes. As potential applications we showcase metagenome contig classification, genome sample enrichment and genome bin comparisons. The open source implementation MGLEX is available via the Python Package Index and on GitHub and can be embedded into metagenome analysis workflows and programs.
Shotgun sequencing of DNA extracted from a microbial community recovers genomic data from different community members while bypassing the need to obtain pure isolate cultures. It thus enables novel insights into ecosystems, especially for those genomes which are inaccessible by cultivation techniques and isolate sequencing. However, current metagenome assemblies are oftentimes highly fragmented, including unassembled reads, and require further processing to separate data according to the underlying genomes. Assembled sequences, called contigs, that originate from the same genome are placed together in this process, which is known as metagenome binning [@TysonCommunity2004; @DrogeTaxonomic2012] and for which many programs have been developed. Some are trained on reference sequences, using contig
Recently, oftentimes multiple biological or technical samples of the same environment are sequenced to produce distinct genome copy numbers across samples, sometimes using different sequencing protocols and technologies, such as Illumina and PacBio sequencing [@HagenQuantitative2016]. Genome copies are reflected by corresponding read coverage variation in the assemblies which allows to resolve samples with many genomes. The combination of experimental techniques helps to overcome platform-specific shortcomings such as short reads or high error rates in the data analysis. However, reconstructing high-quality bins of individual strains remains difficult without very high numbers of replicates. Often, genome reconstruction may improve by manual intervention and iterative analysis ([@fig:binning_workflow]) or additional sequencing experiments.
Genome bins can be constructed by consideration of genome-wide sequence properties. Currently, oftentimes the following types of information are considered:
- Read contig coverage: sequencing read coverage of assembled contigs, which reflects the genome copy number (organismal abundance) in the community. Abundances can vary across biological or technical replicates, and co-vary for contigs from the same genome, supplying more information to resolve individual genomes [@BaranJoint2012; @AlbertsenGenome2013].
- Nucleotide sequence composition: the frequencies of short nucleotide subsequences of length
$k$ called$k$ -mers. The genomes of different species have a characteristic$k$ -mer spectrum [@KarlinCompositional1997; @MchardyAccurate2007]. - Sequence similarity to reference sequences: a proxy for the phylogenetic relationship to species which have already been sequenced. The similarity is usually inferred by alignment to a reference collection and can be expressed using taxonomy [@MchardyAccurate2007].
Probabilities represent a convenient and efficient way to represent and combine information that is uncertain by nature. Here, we
- propose a probabilistic aggregate model for binning based on three commonly used information sources, which can easily be extended to include new features.
- outline the features and submodels for each information type. As the feature types listed above derive from distinct processes, we define for each of them independently a suitable probabilistic submodel.
- showcase several applications related to the binning problem
A model with data-specific structure poses an advantage for genome recovery in metagenomes because it uses data more efficiently for fragmented assemblies with short contigs or a low number of samples for differential coverage binning. Being probabilistic, it generates probabilities instead of hard labels so that a contig can be assigned to several, related genome bins and the uncertainty can easily be assessed. The models can be applied in different ways, not just classification, which we show in our application examples. Most importantly, there is a rich repertoire of higher-level procedures based on probabilistic models, including Expectation Maximization (EM) and Markov Chain Monte Carlo (MCMC) methods for clustering without or with few prior knowledge of the modeled genomes.
We focus on defining explicit probabilistic models for each feature type and their combination into an aggregate model. In contrast, binning methods often concatenate and transform features [@ChatterjiCompostbin2008; @AlnebergBinning2014; @ImelfortGroopm2014] before clustering. Specific models for the individual data types can be better tailored to the data generation process and will therefore generally enable a better use of information and a more robust fit of the aggregate model while requiring fewer data. We propose a flexible model with regard to both the included features and the feature extraction methods. There already exist parametric likelihood models in the context of clustering, for a limited set of features. For instance, @KislyukUnsupervised2009 use a model for nucleotide composition and @WuMaxbin2014 integrated distance-based probabilities for 4-mers and absolute contig coverage using a Poisson model. We extend and generalize this work so that the model can be used in different contexts such as classification, clustering, genome enrichment and binning analysis. Importantly, we are not providing an automatic solution to binning but present a flexible framework to target problems associated with binning. This functionality can be used in custom workflows or programs for the steps illustrated in [@fig:binning_workflow]. As input, the model incorporates genome abundance, nucleotide composition and additionally sequence similarity (via taxonomic annotation). The latter is common as taxonomic binning output [@WoodKraken2014; @DrogeTaxatortk2014; @GregorPhylopythias2016] and for quality assessment but has rarely been systematically used as features in binning [@ChatterjiCompostbin2008; @LuCocacola2016]. We show that taxonomic annotation is valuable information that can improve binning considerably.
Classification is a common concept in machine learning. Usually, such algorithms use training data for different classes to construct a model which then contains the condensed information about the important properties that distinguish the data of the classes. In probabilistic modeling, we describe these properties as parameters of likelihood functions, often written as
Let
- a weight
$w_i$ (contig length) - sample abundance feature vectors
$\bm{a_i}$ and$\bm{r_i}$ , one entry per sample - a compositional feature vector
$\bm{c_i}$ , one entry per compositional feature (e.g. a$k$ -mer) - a taxonomic feature vector
$\bm{t_i}$ , one entry per taxon
We define the individual feature vectors in the corresponding sections. As mentioned before, each of the
For the
$$ \mathcal{L}(\mathbf{\Theta_g} \mid \mathbf{F_i}) = \left( \prod_{k=1}^M \mathcal{L}(\bm{\mathit{\Theta_{gk}}} \mid \bm{F_{ik}})^{\alpha_k} \right)^\beta $$ {#eq:likelihood_aggregate}
We assume statistical independence of the feature subtypes and multiply likelihood values from the corresponding submodels. This is a simplified but reasonable assumption: e.g., the species abundance in a community can be altered by external factors without impacting the nucleotide composition of the genome or its taxonomic position. Also, there is no direct relation between a genome's
All model parameters,
In the following, we denote the
We derive the average number of reads covering each contig position from assembler output or by mapping the reads back onto contigs. This mean coverage is a proxy for the genome abundance in the sample because it is roughly proportional to the genome copy number. A careful library preparation causes the copy numbers of genomes to vary differently over samples, so that each genome has a distinct relative read distribution. Depending on the amount of reads in each sample being associated with every genome, we obtain for every contig a coverage vector
Random sequencing followed by perfect read assembly theoretically produces positional read counts which are Poisson distributed, as described in @LanderGenomic1988. In [@eq:likelihood_poisson], we derived a similar likelihood using mean coverage values (see Supplementary Methods for details). The likelihood function is a normalized product over the independent Poisson functions
$$ \mathcal{L}(\bm{\theta} \mid \bm{a_i}) = \sqrt[len(\bm{a_i})]{\prod_{j=1}^{len(\bm{a_i})} P_{\theta_j}(a_{i,j})} = \sqrt[len(\bm{a_i})]{\prod_{j=1}^{len(\bm{a_i})} \frac{\theta_j^{a_{i,j}}}{a_{i,j} !} e^{-\theta_j}} $$ {#eq:likelihood_poisson}
The Poisson explicitly accounts for low and zero counts, unlike a Gaussian model. Low counts are often observed for undersequenced and rare taxa. Note that
The maximum likelihood estimate (MLE) for
$$ \bm{\hat \theta} = \dfrac{ \sum\limits_{i=1}^{N} w_i , \bm{a_i} }{ \sum\limits_{i=1}^{N} w_i } $$ {#eq:mle_poisson}
In particular for shorter contigs, the absolute read coverage is often overestimated. Basically, the Lander-Waterman assumptions [@LanderGenomic1988] are violated if reads do not map to their original locations due to sequencing errors or if they "stack" on certain genome regions because they are ambiguous (i.e. for repeats or conserved genes), rendering the Poisson model less appropriate. The Poisson, when constrained on the total sum of coverages in all samples, leads to a binomial distribution as shown by [@PrzyborowskiHomogeneity1940]. Therefore, we model differential abundance over different samples using a binomial in which the parameters represent a relative distribution of genome reads over the samples. For instance, if a particular genome had the same copy number in a total of two samples, the genome's parameter vector
The contig features
$$ \mathcal{L}(\bm{\theta} \mid \bm{r_i}) = \sqrt[len(\bm{r_i})]{\prod_{j=1}^{len(\bm{r_i})} B_{R_i,\theta_j}(r_{i,j})} = \sqrt[len(\bm{r_i})]{\prod_{j=1}^{len(\bm{r_i})} \binom{R_i}{r_{i,j}} \theta_j^{r_{i,j}} \left( 1 - \theta_j \right)^{\left( R_i - r_{i,j} \right)}} $$ {#eq:likelihood_binomial}
$$ {n \choose k} = \frac{\Gamma(n+1) \Gamma(k+1)}{\Gamma(n-k+1)} $$ {#eq:}
Because the binomial coefficient is a constant factor and independent of
The MLE
$$ \bm{\hat \theta} = \dfrac{ \sum\limits_{i=1}^N w_i , \bm{r_i} }{ \sum\limits_{i=1}^N w_i , R_i } $$ {#eq:mle_binomial}
It is obvious that absolute and relative abundance models are not independent when the identical input vectors (here
Microbial genomes have a distinct "genomic fingerprint" [@KarlinCompositional1997] which is typically determined by means of
For its simplicity and effectiveness, we chose a likelihood model assuming statistical independence of features so that the likelihood function in [@eq:likelihood_nbayes] becomes a simple product over observation probabilities (or a linear model when transforming into a log-likelihood). Though
$$ \mathcal{L}(\bm{\theta} \mid \bm{c_i}) = \prod_{i=1}^{len(\bm{c_i})} \theta_i^{c_i} $$ {#eq:likelihood_nbayes}
The genome parameter vector
$$ \bm{\hat \theta} = \dfrac{ \sum\limits_{i=1}^{N} w_i , \bm{c_i} }{ \sum\limits_{i=1}^{N} w_i } $$ {#eq:mle_nbayes}
We can compare contigs to reference sequences, for instance by local alignment. Two contigs that align to closely related taxa are more likely to derive from the same genome than sequences which align to distant clades. We convert this indirect relationship to explicit taxonomic features which we can compare without direct consideration of reference sequences. A taxon is a hierarchy of nested classes which can be written as a tree path, for example, the species E. coli could be written as [Bacteria, Gammaproteobacteria, Enterobacteriaceae, E. coli].
We assume that distinct regions of a contig, such as genes, can be annotated with different taxa. Each taxon has a corresponding weight which in our examples is a positive alignment score. The weighted taxa define a spectrum over the taxonomy for every contig and genome. It is not necessary that the alignment reference be complete or include the respective species genome but all spectra must be equally biased. Since each contig is represented by a hierarchy of
Node | Taxon | Level |
Index |
Score | |
---|---|---|---|---|---|
a | Bacteria | 1 | 1 | 0 | 7 |
b | Gammaproteobacteria | 2 | 1 | 0 | 6 |
c | Betaproteobacteria | 2 | 2 | 1 | 1 |
d | Enterobacteriaceae | 3 | 1 | 0 | 5 |
e | Yersiniaceae | 3 | 2 | 1 | 1 |
f | E. vulneris | 4 | 1 | 1 | 1 |
g | E. coli | 4 | 2 | 3 | 3 |
h | Yersinia sp. | 4 | 3 | 1 | 1 |
Table: Calculating the contig features
Each vector
$$ \mathcal{L}(\bm{\theta} \mid \bm{t_i}) = \prod_{l=1}^{L} \prod_{j=1}^{T_l} \theta_{l,j}^{t_{i,l,j}} $$ {#eq:likelihood_hnbayes}
For simplicity, we assume that layer likelihoods are independent which is not quite true but effective. The MLE for each
$$ \hat \theta_{l} = \frac{\sum\limits_{i=1}^N t_{i,l}}{\sum\limits_{j=1}^{T_l} \sum\limits_{i=1}^N t_{i,l}} $$ {#eq:mle_hnbayes}
The aggregate likelihood for a contig in [@eq:likelihood_aggregate] is a weighted product of submodel likelihoods. The weights in vector
$$ l(\mathbf{\Theta} \mid \mathbf{F_i}) = \beta \sum_{k=1}^{M} \alpha_k , l(\bm{\mathit{\Theta_k}} \mid \bm{F_{i,k}}) $$ {#eq:loglikelihood_aggregate}
For any modeled genome, each of the
Because
Parameter
We simulated reads of a complex microbial community from 400 publicly available genomes (Supplementary Methods and Supplementary Table 1). These comprised 295 unique and 44 species with each two or three strain genomes to mimic strain heterogeneity. Our aim was to create a difficult benchmark dataset under controlled settings, minimizing potential biases introduced by specific software. We sampled abundances from a lognormal distribution because it has been described as a realistic model [@SchlossCensus2006]. We then simulated a primary community which was then subject to environmental changes resulting in exponential growth of 25% of the community members at growth rates which where chosen uniformly at random between one and ten whereas the other genome abundances remained unchanged. We applied this procedure three times to the primary community which resulted in one primary and three secondary artificial community abundances profiles. With these, we generated 150 bp long Illumina HiSeq reads using the ART simulator [@HuangArt2012] and chose a yield of 15 Gb per sample. The exact amount of read data for all four samples after simulation was 59.47 Gb. To avoid any bias caused by specific metagenome assembly software and to assure a constant contig length, we divided the original genome sequences into non-overlapping artificial contigs of 1 kb length and selected a random 500 kb of each genome to which we mapped the simulated reads using Bowtie2 [@LangmeadFast2012]. By the exclusion of some genome reference, we imitated incomplete genome assemblies when mapping reads, which affects the coverage values. Finally, we subsampled 300 kb contigs per genome with non-zero read coverage in at least one of the samples to form the demonstration dataset (120 Mb), which has 400 genomes (including related strains), four samples and contigs of size 1 kb. Due to the short contigs and few samples, this is a challenging dataset for complete genome recovery [@NielsenIdentification2014] but suitable to demonstrate the functioning of our model with limited data. For each contig we derived 5-mer frequencies, taxonomic annotation (removing species-level genomes from the reference sequence data) and average read coverage per sample, as described in the Supplementary Methods.
We evaluated the performance of the model when classifying contigs to the genome with the highest likelihood, a procedure called Maximum Likelihood (ML) classification. We applied a form of three-fold cross-validation, dividing the simulated data set into three equally-sized parts with 100 kb from every genome. We used only 100 kb (training data) of every genome to infer the model parameters and the other 200 kb (test data) to measure the classification error. 100 kb was used for training because it is often difficult to identify sufficient training data in metagenome analysis. For each combination of submodels, we calculated the mean squared error (MSE) and mean pairwise coclustering (MPC) probability for the predicted (ML) probability matrices (Suppl. Methods), averaged over the three test data partitions. We included the MPC as it can easily be interpreted: for instance, a value of 0.5 indicates that on average 50% of all contig pairs of a genome end up in the same bin after classification. [@tbl:classification_consistency] shows that the model integrates information from each data source such that the inclusion of additional submodels resulted in a better MPC and also MSE, with a single exception when combining absolute and relative abdundance models which resulted in a marginal increase of the MSE. We also found that taxonomic annotation represents the most powerful information type in our simulation. For comparson, we added scores for NBC [@RosenNbc2011], a classifier based on nucleotide composition with in-sample training using 5-mers and 15-mers, and Centrifuge [@KimCentrifuge2016], a similarity-based classifier both with in-sample and reference data. These programs were given the same information as the corresponding submodels and they rank close to these. In a further step, we investigated how the presence of very similar genomes impacted the performance of the model. We first collapsed strains from the same species by merging the corresponding columns in the classification likelihood matrix, retaining the entry with the highest likelihood, and then computed the resulting coclustering performance increase $\Delta$MPCML. Considering assignment on species instead of strain level showed a larger $\Delta$MPCML for nucleotide composition and taxonomic annotation than for absolute and relative abundance. This is expected, because both do not distinguish among strains, whereas genome abundance does in some, but not all cases.
Submodels | MPC |
$\Delta$MPC |
MSE |
---|---|---|---|
Centrifuge (in-sample) | 0.01 | +0.01 | 0.51 |
NBC ($15$-mers) | 0.02 | +0.00 | 0.66 |
AbAb | 0.03 | +0.00 | 0.58 |
ReAb | 0.08 | +0.02 | 0.61 |
Centrifuge (reference) | 0.13 | +0.03 | 0.45 |
AbAb + ReAb | 0.21 | +0.04 | 0.59 |
NuCo | 0.30 | +0.06 | 0.52 |
NBC ($5$-mers) | 0.34 | +0.06 | 0.48 |
ReAb + NuCo | 0.41 | +0.07 | 0.48 |
AbAb + NuCo | 0.43 | +0.08 | 0.50 |
TaAn | 0.46 | +0.09 | 0.41 |
AbAb + ReAb + NuCo | 0.52 | +0.09 | 0.44 |
NuCo + TaAn | 0.52 | +0.09 | 0.40 |
AbAb + TaAn | 0.54 | +0.09 | 0.39 |
AbAb + NuCo + TaAn | 0.60 | +0.10 | 0.37 |
ReAb + TaAn | 0.60 | +0.10 | 0.36 |
ReAb + NuCo + TaAn | 0.64 | +0.11 | 0.34 |
AbAb + ReAb + TaAn | 0.65 | +0.10 | 0.35 |
AbAb + ReAb + NuCo + TaAn | 0.68 | +0.11 | 0.33 |
Table: Cross-validation performance of ML classification for all possible combinations of submodels. We calculated the mean pairwise coclustering (MPC), the strain to species MPC improvement ($\Delta$MPCML) and the mean squared error (MSE). AbAb = absolute total abundance; ReAb = relative abundance; NuCo = nucleotide composition; TaAn = taxonomic annotation. NBC (v1.1) and Centrifuge (v.1.0.3b) are external classifiers added for comparison. Best values are in bold and worst in italic. {#tbl:classification_consistency}
The contig length of 1 kb in our simulation is considerably shorter, and therefore harder to classify, than sequences which can be produced by current assembly methods or by some cutting-edge sequencing platforms [@GoodwinComing2016]. In practice, longer contigs can be classified with higher accuracy than short ones, as more information is provided as a basis for assignment. For instance, a more robust coverage mean, a
The free model parameter
Enrichment is commonly known as an experimental technique to increase the concentration of a target substance relative to others in a probe. Thus, an enriched metagenome still contains a mixture of different genomes, but the target genome will be present at much higher frequency than before. This allows a more focused analysis of the contigs or an application of methods which seem prohibitive for the full data by runtime or memory considerations. In the following, we demonstrate how to filter metagenome contigs by p-value to enrich in-silico for specific genomes. Often, classifiers model an exhaustive list of alternative genomes but in practice it is difficult to recognize all species or strains in a metagenome with appropriate training data. When we only look at individual likelihoods, for instance the maximum among the genomes, this can be misleading if the contig comes from a missing genome. For better judgment, a p-value tells us how frequent or extreme the actual likelihood is for each genome. Many if not all binning methods lack explicit significance calculations. We can take advantage of the fact that the classification model compresses all features into a genome likelihood and generate a null (log-)likelihood distribution on training data for each genome. Therefore, we can associate empirical p-values with each newly classified contig and can, for sufficiently small p-values, reject the null hypothesis that the contig belongs to the respective genome. Since this is a form of binary classification, there is the risk to reject a good contig which we measure as sensitivity.
We enriched a metagenome by first training a genome model and then calculating the p-values of remaining contigs using this model. Contigs with higher p-values than the chosen critical value were discarded. The higher this cutoff is, the smaller the enriched sample becomes, but also the target genome will be less complete. We calculated the reduced sample size as a function of the p-value cutoff for our simulation ([@fig:genome_enrichment]). Selecting a p-value threshold of 2.5% shrinks the test data on average down to 5% of the original size. Instead of an empirical p-value, we could also use a parametrized distribution or select a critical log-likelihood value by manual inspection of the log-likelihood distribution (see [@fig:alpha_inference] for an example of such a distribution). This example shows that generally a large part of a metagenome dataset can be discarded while retaining most of the target genome sequence data.
The model can be used to analyze bins of metagenome contigs, regardless of the method that was used to infer these bins. Specifically, one can measure the similarity of two bins in terms of the contig likelihood instead of, for instance, an average euklidean distance based on the contig or genome
To compare two specific bins, we select the corresponding pair of columns in the classification likelihood matrix and calculate two mixture likelihoods for each contig (rows),
$$ \hat L = \hat \pi_A , L_A + \hat \pi_B , L_B = \left(\tfrac{L_A}{L_A + L_B}\right) L_A + \left(\tfrac{L_B}{L_A + L_B}\right) L_B = \frac{L_A^2 + L_B^2}{L_A + L_B} $$ {#eq:mixture_likelihood_opt}
$$ L_{swap} = \hat \pi_A , L_B + \hat \pi_B , L_A = \left(\tfrac{L_A}{L_A + L_B}\right) L_B + \left(\tfrac{L_B}{L_A + L_B}\right) L_A = \frac{2 L_A L_B}{L_A + L_B} $$ {#eq:mixture_likelihood_swap}
For example, if
$$ \text{S}(A,B) = \sqrt[Z]{\prod\limits_{i=1}^N \left( \frac{2 , L_i(\theta_A) , L_i(\theta_B)}{L_i^2(\theta_A ) + L_i^2(\theta_B)} \right)^{\tfrac{L_i^2(\theta_A) + L_i^2(\theta_B)}{L_i(\theta_A) + L_i(\theta_B)} }} $$ {#eq:mixture_likelihood_similarity}
normalized by the total joint mixture likelihood
$$ Z = \sum_{i=1}^N \frac{L_i^2(\theta_A) + L_i^2(\theta_B)}{L_i(\theta_A) + L_i(\theta_B)} $$ {#eq:mixture_likelihood_similarity_constant}
The quantity in [@eq:mixture_likelihood_similarity] ranges from zero to one, reaching one when the two bin models produce identical likelihood values. We can therefore interpret the ratio as a percentage similarity between any two bins. A connection to the Kullback-Leibler divergence can be constructed (Supplementary Methods).
To demonstrate the application, we trained the model on our simulated genomes, assuming they were bins, and created trees ([@fig:tree_bin_comparison]) for a randomly drawn subset of 50 of the 400 genomes using the probabilistic bin distances
We applied the model to show one of its current use cases on more realistic data. We downloaded the medium complexity dataset from www.cami-challenge.org. This dataset is quite complex (232 genomes, two sample replicates). We also retrieved the results of two highest-performing automatic binning programs, MaxBin and Metawatt, in the CAMI challenge evaluation [@SczyrbaCritical2017]. We took the simplest possible approach: we trained MLGEX on the genome bins derived by these methods and classified the contigs to the bins with the highest likelihood, thus ignoring all details of contig splitting,
Binner | Variant | Bin count | Recall (bp) | ARI |
---|---|---|---|---|
Metawatt | unmodified | 285 | 0.94 | 0.75 |
Metawatt | MGLEX swapped contigs | 285 | 0.94 | 0.82 |
Metawatt | MGLEX all contigs | 285 | 1.00 | 0.77 |
MaxBin | unmodified | 125 | 0.82 | 0.90 |
MaxBin | MGLEX swapped contigs | 125 | 0.82 | 0.92 |
MaxBin | MGLEX all contigs | 125 | 1.00 | 0.76 |
Table: Genome bin refinement for CAMI medium complexity dataset with 232 genomes and two samples. The recall is the fraction of overall assigned contigs (bp). The Adjusted Rand index (ARI) is a measure of binning precision. The unmodified genome bins are the submissions to the CAMI challenge using the corresponding unsupervised binning methods Metawatt and MaxBin. MGLEX swapped contigs: contigs in original genome bins reassigned to the bin with highest MGLEX likelihood. MGLEX all contigs: all contigs (with originally uncontained) assigned to the bin with highest MGLEX likelihood. The lowest scores are written in italic and highest in bold. {#tbl:genome_refinement}
We provide a Python package called MGLEX, which includes the described model. Simple text input facilitates the integration of external programs for feature extraction like
We describe an aggregate likelihood model for the reconstruction of genome bins from metagenome data sets and show its value for several applications. The model can learn from and classify nucleotide sequences from metagenomes. It provides likelihoods and posterior bin probabilities for existing genome bins, as well as p-values, which can be used to enrich a metagenome dataset with a target genome. The model can also be used to quantify bin similarity. It builds on four different submodels that make use of different information sources in metagenomics, namely contig coverage, nucleotide composition and previous taxonomic assignments. By its modular design, the model can easily be extended to include additional information sources. This modularity also helps in interpretation and computations. The former, because different features can be analyzed separately and the latter, because submodels can be trained independently and in parallel.
In comparison to previously described parametric binning methods, our model incorporates two new types of features. The first is relative differential coverage, for which, to our knowledge, this is the first attempt to use binomials to account for systematic bias in the read mapping for different genome regions. As such, the binomial submodel represents the parametric equivalent of covariance distance clustering. The second new type is taxonomic annotation, which substantially improved the classification results in our simulation. Taxonomic annotations, as used in the model and in our simulation, were not correct up to the species level and need not be, as seen in the classification results. We only require the same annotation method be applied to all sequences. In comparison to previous methods, our aggregate model has weight parameters to combine the different feature types and allows tuning the bin posterior distribution by selection of an optimal smoothing parameter
We showed that probabilistic models represent a good choice to handle metagenomes with short contigs or few sample replicates, because they make soft, not hard decisions, and because they can be applied in numerous ways. When the individual submodels are trained, genome bin properties are compressed into fewer model parameters, such as mean values, which are mostly robust to outliers and therefore tolerate a certain fraction of bin pollution. This property allows to reassign contigs to bins, which we demonstrated in the "Genome bin refinement" section. Measuring the performance of the individual submodels and their corresponding features on short simulated contigs ([@tbl:classification_consistency]), we find that they discriminate genomes or species pan-genomes by varying degrees. Genome abundance represents, in our simulation with four samples, the weakest single feature type, which will likely become more powerful with increasing sample numbers. Notably, genomes of individual strains are more difficult to distinguish than species level pangenomes using any of the features. In practice, if not using idealized assemblies as in our current evaluation, strain resolution poses a problem to metagenome assembly, which is currently not resolved in a satisfactory manner [@SczyrbaCritical2017].
The current MGLEX model is somewhat crude because it makes many simplifying assumptions in the submodel definitions. For instance, the multi-layer model for taxonomic annotation assumes that the probabilities in different layers are independent, the series of binomials for relative abundance should be replaced by a multinomial to accout for the parameter dependencies or the absolute abdundance Poisson model should incorporate overdispersion to model the data more appropriately. Exploiting this room for improvement can lead to further improvement in the performance while the overall framework and usage of MGLEX stays unchanged. When we devised our model, we had an embedding into more complex routines in mind. In the future, the model can be used in inference procedures such as EM or MCMC to infer or improve an existing genome binning. Thus, MGLEX provides a software package for use in other programs. However, it also represents a powerful stand-alone tool for the adept user in its current form.
Currently, MGLEX does not yet have support for multiple processors and only provides the basic functionality presented here. However, training and classification can easily be implemented in parallel because they are expressed as matrix multiplications. The model requires sufficient training data to robustly estimate the submodel weights
Our open-source Python package MGLEX provides a flexible framework for metagenome analysis and binning which we intent to develop further together with the metagenomics research community. It can be used as a library to write new binning applications or to implement custom workflows, for example to supplement existing binning strategies. It can build upon a present metagenome binning by taking assignments to bins as input and deriving likelihoods and p-values that allow for critical inspection of the contig assignments. Based on the likelihood, MGLEX can calculate bin similarities to provide insight into the structure of data and community. Finally, genome enrichment of metagenomes can improve the recovery of particular genomes in large datasets.
We thank S. Reimering, A. Weimann and A. Bremges for proofreading and constructive feedback.