nQuire implements a set of commands to estimate ploidy level of individuals from species, where recent polyploidization occurred and intraspecific ploidy variation is observed. Specifically, nQuire uses next-generation sequencing data to distinguish between diploids, triploids and tetraploids, on the basis of frequency distributions at variant sites where only two bases are segregating.
For more background see also the publication at BMC Bioinformatics.
To use nQuire, you will need a Linux machine where the gcc compiler, as well as the libz, libm and libpthread system libraries are installed. To acquire nQuire, first clone the repository recursively:
git clone --recursive https://github.com/clwgg/nQuire
This will clone both the nQuire code, as well as the htslib module, which is its only external dependency. After cloning, first compile the submodule, and then the nQuire code:
cd nQuire
make submodules
make
This will create the nQuire binary, which you can copy or move anywhere for subsequent use.
When updating to the current version, please make sure to also update the submodules:
git pull origin master
git submodule update
make submodules
make
nQuire requires as input only a .bam
alignment file, which contains
NGS reads mapped to a suitable reference genome. The required
information is extracted from the .bam
alignments into a binary file
(suffix .bin
), which is generated using the create
subcommand. By
default, only sites where two bases are segregating at a minimum
frequency of 0.2
are reported. These settings are recommended to use
for the subsequent analyses of ploidy. The minimum frequency may be
adjusted using the -f
flag. It is also possible to establish filters
for mapping quality and minimum coverage through the -q
and -c
options, respectively. By default, the minimum mapping quality is set to 1
, and
the minimum coverage to 10
. It is important to establish a reasonable minimum
coverage cutoff, as base frequencies at biallelic sites have to be assessed.
nQuire create -b input.bam -o base
The .bin
file can also store reference IDs and positional information
(genomic coordinates) when the create
subcommand is used with the
-x
flag.
nQuire create -b input.bam -o base -x
It is now possible for nQuire to interact with BED
files in two
ways. In both cases the BED
file has to include the following columns:
- Sequence/Chromosome name
- Start coordinate (0-based)
- End coordinate (1-based)
- Region name
Per default when a BED
file is supplied, nQuire will create a separate
.bin
file for each region. The output file will be named [base]-[region_name].bin
.
This allows running the model by
region, chromosome or window, for example to detect aneuploidies.
nQuire create -b input.bam -o base -r region.bed
Alternatively, all BED
regions can be concatenated into one file. This is useful
for example to query specific annotations or regions targeted with capture
experiments. The output file will have the name [base]-bedcc.bin
.
nQuire create -b input.bam -o base -r region.bed -y
The file produced by the create
subcommand can be visually inspected
using either the view
or histo
subcommands. The histo
subcommand
will produce a ASCII histogram on the command line based on the base
frequencies stored in the .bin
file.
nQuire histo base.bin
The view
subcommand allows to inspect the coverage and base counts
at all positions in the file, as well as filter them to produce a new
.bin
file.
nQuire view base.bin
The columns in the output of the view
subcommand represent the
following:
- Coverage
- Base count 1
- Base count 2
…
If an extended .bin
was created using the -x
flag of the create
subcommand, the view
subcommand can interact with this file too.
Since it is much easier to store reference IDs instead of reference
names, the view
subcommand allows to re-annotate these reference IDs
with their corresponding names if the .bam
file that the .bin
file was
created from is handed to it via the -a
flag.
nQuire view base.bin -a input.bam
The columns in the output of the view
subcommand used on an extended
.bin
file represent the following:
- Reference sequence (ID)
- Reference position (0-based)
- Coverage
- Base count 1
- Base count 2
…
Using the -f
flag of the view
subcommand one can query the type of
the .bin
, which so far is 0
for the default format, and 1
for the
extended format.
nQuire view -f base.bin
In many cases, the base frequency histogram contains a high baseline
of noise, which results mostly from mismappings and is elevated in
highly repetitive genomes. This can to some extend be handled using a
stringent mapping quality cutoff in the creation of the .bin
(e.g. -q 30
).
To tackle this problem more efficiently, nQuire also contains the
subcommand denoise
. It uses a Gaussian Mixture Model with Uniform
noise component (GMMU, for more information please refer to the next
section “Model” or the publication referenced above) to assess the
extent of this uniform noise, and scales it down allowing to easily
detect peaks in the histogram of base frequencies.
nQuire denoise base.bin -o base_denoised
The denoise
subcommand also returns the percentage of information
kept after the denoising procedure. If this value is suspiciously low,
there might not be enough data left for subsequent testing. Please
inspect the histogram also with the histo
command before and after
denoising to visually assess the shape of the distribution of base
frequencies.
The main testing framework of nQuire utilizes a Gaussian Mixture Model (GMM, please refer to the next section “Model” as well as the publication referenced above), which describes the histogram as a mixture of Gaussians with varying means and mixture proportions. The likelihood of certain assumptions based on this model given the empirical data is maximized using an Expectation-Maximization (EM) algorithm.
The most important subcommand using the GMM is
lrdmodel
. This is a mixture of the three fixed models from
modeltest
and the free model in estmodel
, as all four of those
models are used. Subsequently, the maximized log-likelihood of the
three fixed models are subtracted from the maximized log-likelihood of
the free model to get three delta log-likelihoods. As the
log-likelihood of the free model can basically be seen as the
“optimum” for the empirical data under the assumptions of this model,
the higher the delta log-likelihood of a fixed model, the further it
is from the optimum and the lower is the support for the corresponding
ploidy level.
nQuire lrdmodel base.bin
Since this is the major analysis step of the tool, it allows for multithreading
over multiple input files. These may be different samples, or different regions
of the same bam file split by BED
regions (see section on the create
subcommand).
nQuire lrdmodel -t n_threads file1.bin [file2.bin ...]
The output from lrdmodel
contains 8 tab-separated columns:
- Filename
- Free model maximized log-likelihood
- Diploid fixed model maximized log-likelihood
- Triploid fixed model maximized log-likelihood
- Tetraploid fixed model maximized log-likelihood
- Diploid delta log-likelihood
- Triploid delta log-likelihood
- Tetraploid delta log-likelihood
The modeltest
subcommand maximizes
the likelihood under the assumption of either di-, tri- or tetraploidy
where mean and mixture proportions are fixed, and only the standard
deviation of the Gaussians is varied.
nQuire modeltest base.bin
It returns the log-likelihood for each of the assumed ploidy levels, together with the standard deviation of the Gaussians included in that model.
When running the estmodel
subcommand no assumptions
are made and the EM-algorithm maximizes the likelihood of a mixture of
three Gaussians given the empirical data freely.
nQuire estmodel base.bin
The result is the maximized log-likelihood when parameters can be varied freely, as well as all parameter estimates for the three Gaussians (mixture proportion, mean and standard deviation).
The simpler framework just uses ideal histograms under the
assumption of each of the ploidy levels (diploid: N(0.5,0.05);
triploid: N(0.33,0.04) + N(0.67,0.04); tetraploid: N(0.25,0.04) +
N(0.5,0.05) + N(0.75,0.04)) and does linear regression on the y-values
of the empirical and the ideal histograms. The subcommand for that is
histotest
.
nQuire histotest base.bin
histotest
reports for each ploidy level the sum of squared residuals
(SSR) of empirical vs. ideal histograms, as well as the slope, its
standard error and the R2 of the regression of y-values. A good fit
between ideal and empirical histograms is characterized by low SSR,
positive slope with low standard error, as well as a high R2.
At the heart of nQuire is a Gaussian Mixture Model (GMM) which is used
in the modeltest
, estmodel
and lrdmodel
subcommands. For the
denoise
subcommand it is extended to a Gaussian Mixed Model with
Uniform noise component (GMMU).
The GMM aims to model the read frequency histogram as a mixture of up to three Gaussian distributions between 0 and 1, that are scaled relatively to each other by some mixture proportion. This model can be used for parameter estimation through maximum likelihood estimation using an Expectation-Maximization (EM) algorithm, as well as model comparison when we have specific expectations about our data. We use up to three Gaussians, because the expected distributions of read frequencies at biallelic sites for each of our ploidy levels of interest are one Gaussian with mean 0.5 for diploid, two Gaussians with means 0.33 and 0.67 for triploid, and three Gaussians with means 0.25, 0.5 and 0.75 for tetraploid. We can fix these values in the GMM to assess the maximal log-likelihood under each of the three assumptions (three fixed models). Additionally we can estimate the parameters without constraints to get the maximal log-likelihood under complete freedom (one free model). The comparison of maximized log-likelihoods under the fixed models to the free model then allows us to assess how close each of these three ploidy assumptions are to the optimum under the GMM model.
For the denoise
command there is a fourth component added to the
three Gaussians, which has uniform probability density and only its
mixture proportion can be varied. Together with a free model for the
three Gaussians, the model under maximized likelihood allows us to
assess the proportion of uniform noise in the histogram.