taxMaps is an ultra-efficient, customizable and fully scalable taxonomic classification tool for short-read data designed to deal with large DNA/RNA metagenomics samples. Its performance and comprehensiveness makes it highly suitable for unbiased contamination detection in large-scale sequencing operations, microbiome studies comprising a large number of samples, and for applications where the analysis delivery time is a critical factor, such as pathogen identification from clinical or environmental samples.
- Version: 0.2
- Author: Andre Corvelo, New York Genome Center
taxMaps is freely available for academic and non-commercial research purposes (LICENSE.txt
).
- Flexible input -
.bam
,.fastq
(compressed/uncompressed, interleaved or not) or any custom command that streams.fastq
toSTDOUT
- Thorough preprocessing - quality trim, multiple adapter removal, hard trim (5' and 3'), low complexity filter.
- Mapping prioritization - through the use of multiple custom indexes (useful for spike-in and host subtraction).
- Performance - taxMaps takes full advantage of the outstanding features and performance of GEM. When using pre-compressed indexes, this state-of-the-art aligner can map in a very sensitive (up to 20% edit distance) manner several million reads per hour against a single-file nucleotide database encompassing more than 370 Gbases, including NCBI’s nt database, microbial RefSeq and thousands of microbial and viral genomes, on a single 16 CPU machine.
- Detailed mapping reports - including coverage and edit-distance histograms per taxa, which can be visualized interactively.
- SGE support.
- Installation
- Databases
- Running taxMaps
- Quick reference
- python 2.7 www.python.org
- numpy 1.7.0 or higher www.numpy.org
- samtools 0.1.19 or higher http://sourceforge.net/projects/samtools
- cutadapt 1.4.1 or higher code.google.com/p/cutadapt
- PRINSEQ-lite 0.20.3 or higher prinseq.sourceforge.net
- GEM pre-release 3 sourceforge.net/projects/gemlibrary
- Krona 2.4 http://sourceforge.net/projects/krona
All of the executables from programs above should be in your $PATH
For Krona, you will need to set PERL5LIB
with the following command:
$ export PERL5LIB=path_to_krona/lib/
After setting all the dependencies, confirm that you have the right version of python
and numpy
:
$ python --version ; python -c "import numpy; print 'NumPy ' + numpy.__version__"
and that you can run the following commands: samtools
, cutadapt
, prinseq-lite.pl
, gem-indexer
, gem-mapper
and ktImportText
.
taxMaps is available through github and can be obtained by using git
with following command:
$ git clone git://github.com/nygenome/taxmaps.git
This will download the taxMaps repository locally into a taxmaps/
directory.
Alternatively, you can go to the "Releases" tab in https://github.com/nygenome/taxmaps and download the latest version (tar.gz
). In this case you need to extract taxMaps with the following command:
$ tar -zxvf taxmaps.{latest_version}.tar.gz
taxMaps doesn't require installation. You can copy taxmaps/
anywhere you want and, to run taxMaps
from anywhere, you just need make sure that {path_to_taxMaps}/
gets added to your PATH
:
$ export PATH=$PATH:{path_to_taxMaps}
Adding line above to .bash_rc
or .bash_profile
configuration files will make it permanent.
taxMaps uses pre-built indexes that are available via FTP on the NYGC public FTP site. These indexes have been compressed using a kmer LCA pre-assignation for several different kmer sizes, which greatly improves mapping efficiency against large databases. To ensure maximum mapping fidelity, you should pick a kmer size just above your read length or hard-trim down your reads to match the kmer size used in compression. For this, you can use the corresponding taxMaps
option -L
. You should also use the taxonomy.tbl
file provided.
We strongly recommend the use of pre-built compressed indexes. While we provide scripts to generate taxMaps-compatible indexes, we do not offer any support on it.
taxMaps requires a taxonomic table file for several of its operations. To generate this table you need to download taxdump.tar.gz from the NCBI FTP site and extract its contents with the following command:
$ tar -zxvf taxdump.tar.gz
Among other files, you should find names.dmp
and nodes.dmp
. The following command:
$ taxMaps-taxtbl -n names.dmp -t nodes.dmp > taxonomy.tbl
will generate a tab-delimited file named taxonomy.tbl
, which contains one taxonomic entity per line with the following field order: TaxId
, Rank
, Name
and PathFromRoot
.
To generate your indexed database you need to download gi_taxid_nucl.dmp.gz from the NCBI FTP site and extract it using the following command:
$ gunzip gi_taxid_nucl.dmp.gz
into gi_taxid_nucl.dmp
. This file contains the correspondence between GI
and TaxID
.
Ensure that your sequence database is in FASTA format and the FASTA headers are in the following NCBI format: >gi|1234|...
- the GI
(1234 in this case) must be in the second field of the |
-delimited header. Also, your FASTA file should not contain any duplicated GIs
.
At this point you should have all the required files to run taxMaps-index
with the following command:
$ taxMaps-index -i {your_db}.fa -c gi_taxid_nucl.dmp -t taxonomy.tbl -p {prefix}
At the end of this process, among others, you should have 4 files:
- A formated
{prefix}.gitax.fa
file where the headers have been changed to>GI_TaxID
. - A
{prefix}.discarded.fa
file with sequences for which there was no associatedTaxID
(Due to NCBI's frequent updates, sometimes not all files are in 'sync'). These sequences will not show up in you index. - A
{prefix}.len
tab-delimited file specifying the length of each sequence in your database containing the following fields:SeqID
andLength
. - A
{prefix}.gem
index file forgem-mapper
.
Generating an index can take from a few minutes to several hours, depending on the size of your database. You can speed it up by running taxMaps-index
on multiple threads by using the option -T
. You can also submit your job to a SGE cluster by specifying a partition/queue with the option -Q
. The number of slots can be specified using the option -S
.
To run taxMaps, first make sure you have all the required files:
- Taxonomic table (
taxonomy.tbl
) - Database index (
{db}.gem
) - Length file (
{db}.len
) - Short read sequences (
{reads}.fq
,{reads}.bam
or any other format that can be converted into.fq
with a single command toSTDOUT
)
A simple taxMaps command should look something like this:
$ taxMaps -f reads.fq -d refseq_viral.gem -e 0.16 -c 8 -t taxonomy.tbl -p samp1 -o txM/sample1
The command above will use 8
CPUs to map all reads in reads.fq
to refseq_viral.gem
, allowing up to an edit distance of 16% (0.16
). All the results will be written to $PWD/txM/sample1
with the prefix samp1
.
By default, taxMaps runs on the current working directory. It is highly recomended to use the option -o
to specify an output directory that will contain all the results and log files. Doing it so, will keep your analysis results organized. It is also recommended to provide a sample prefix with -p
. The following command:
The most basic type of input for taxMaps is FASTQ (.fq
or .fastq
). You can specify a single file with the -f
option.
In case you have chosen a 'paired-end' taxonomic classification mode (-m p
or -m P
), this file should be interleaved with the sequence identifiers for read 1 and read 2 ending in /1
and /2
, respectively. taxMaps can also deal with Casava 1.8 sequence identifiers (see Illumina sequence identifiers).
If your read data is separated in two files (one for read 1 and another for read 2), you can specify both files with the -1
and -2
options. Again, the sequence identifiers must be as explained above and both files should be in perfect synchrony.
taxMaps accepts compressed FASTQ files (.fq.gz
or .fastq.gz
)
If you are trying to map previously unmapped reads to a specific host and you have the alignments in .bam
format, you can use the -b
option. You need to make sure you have the corresponding .bai
index file in the same folder. taxMaps will extract the unmapped reads from your .bam
file. If you have chosen a 'paired-end' classification mode (-m p|P
), taxMaps will only extract pairs for which both mates are unmapped.
Finally, taxMaps can take custom input commands as long as they write fastq to STDOUT
by using the option -i
. In this case, you must use the absolute paths to any file in your command. Here's an example:
$ taxMaps -i "zcat /data/project_x/sample*.fq.gz" ...
If you wish to downsample your data, you can specify the downsample probability by specifying a float number between 0 and 1 through the option -D
. For instance, if you want taxMaps to classify 10% of your data, you should use:
$ taxMaps -f reads.fq -D 0.1 ...
With taxMaps you can process your data prior to mapping. Controlling the quality of the input data usually improves mapping sensitivity and specificity.
taxMaps expects quality scores to be encoded in Phred+33 format (see FASTQ encoding). If your quality data is encoded in Phred+64 format, you should add the --phred64
flag to your command (usually not required for Illumina 1.8+, Sanger, Roche/454, Ion Torrent, PacBio data).
Using the option -q
you can set the threshold for 3' quality-based trimming of your read data. taxMaps will pass this option to cutadapt
. If you wish to remove low quality bases from both ends of the reads, just specify the quality threshold values for the 5' and 3' ends, separed by ,
(e.g. -q 20,20
).
To remove adapter sequence, you can pass a string of cutadapt
options by using the option -a
. For instance, if you want to remove the adapter sequence AGATCGGAAGAGC
from both ends of your reads in two passes, you should specify -a "-b AGATCGGAAGAGC -n 2"
. You can also trim multiple adapter sequences - see cutadapt user manual. You should not pass any cutadapt options related to input and output as these will likely cause taxMaps to crash (e.g., -a "-b AGATCGGAAGAGC -o output.fq"
).
taxMaps, by default and through prinseq
, filters out low-complexity reads defined by entropy values lower than 70
or containing more than 4
% of ambiguous positions (Ns). Both cutoffs can be changed with the options -C
and -N
, respectively.
To setup an upper-bound for read length, you can use the option -L
. taxMaps will trim your reads down to the specified length from the 3' end. To trim your read data from the 5' end, you should use the option -w
and specify the number of bases to be clipped.
Finally, with the option -l
you can control the minimum length a read should have to be eligible for mapping. It defaults to 50
. Very low values will lead to poor mapping specificity.
In taxMaps, you must specify the database index file with the option -d
. Just make sure that for every index {db}.gem
you have the corresponding {db}.len
in the same directory.
By default taxMaps allows up to 20% edit distance. This threshold can be decreased in situations where a greater precision is desired. With the option -e
you can set the maximum edit distance allowed. You must provide a number between 0 and 1 and edit distance will be considered as a fraction of the read length. For instance, if your read length is 150bp, the following command:
$ taxMaps -f reads.fq -d ncbi.gem -e 0.2 ...
will allow a maximum absolute distance of 30 edits. It is not recomended to use values greater than 20% of the read length as it can severely affects mapping performance and/or specificity.
Mapping performance is directly dependent on the number of threads used by the mapper (GEM). This is particularly true when mapping against very large databases, such as NCBI's nt. With the -c
option you can increase the number of threads used for mapping from the default of 1
. The following command:
$ taxMaps -f reads.fq -d ncbi.gem -c 16 ...
will use 16
CPUs.
If you want to map against more than one index in a sequential manner, you should provide a ,
-delimited list of index files:
$ taxMaps -f reads.fq -d phix.gem,human.gem,ncbi.gem ...
The command above will first map all reads against phix.gem
. Then, unmapped reads will be mapped against human.gem
and, finally, reads that were not mapped will be mapped against the ncbi.gem
index. When mapping against more than one index, you can still control the abovementioned parameters in a index-specific manner by providing ,
-delimited lists of values for -e
and -c
:
$ taxMaps -f reads.fq -d phix.gem,human.gem,ncbi.gem \
-e 0.1,0.2,0.2 -c 4,16,16 ...
To run taxMaps, you need to specify through the -t
option, the path to the taxonomic table downloaded from the NYGC public FTP.
For read classification you can choose between three LCA modes:
- 'single-end' (
-m s
) - reads are classified independently. This is the default option. - 'paired-end' (
-m p
) - reads are classified as pairs. In case one of the mates has been discarded in the filtering step or simply didn't map, the pair will be classified solely based on the other mate. This mode is more specific than the 'single-end' mode. - 'paired-end strict' (
-m P
) - reads are also classified as pairs but only when both mates are mapped. This is the more specific classification mode.
For the paired end modes you need to provide either an interleaved FASTQ file, two 'in-sync' FASTQ files, an indexed BAM file where unmapped mates are contigous or a custom command that writes interleaved FASTQ to STDOUT
.
taxMaps can compute basepair coverage histograms for all the taxa represented in your sample. This option is turned OFF by default but you can turn it ON by using the flag --cov
. Bare in mind that by setting this option ON will increase the memory requirements of taxMaps.
For sake of computational efficiency, you can exclude specific taxa from the coverage analysis. For that, you need to provide a ,
-delimited list of TaxIds using the option -x
. This option can be used, for instance, in situations in which you are not interested in coverage over certain sequences, such as spike-ins or hosts. The following command:
$ taxMaps -f reads.fq -d phix.gem,human.gem,ncbi.gem --cov -x 9606,374840 ...
will not compute any coverage histogram for phiX (374840
) and human (9606
) sequences.
In taxMaps you can also exclude from your report, taxa with few reads assigned. These are usually non-specific hits or mapping artifacts. By default, taxMaps will not report taxa with less than 10 reads assigned per million reads in your sample (-z 0.00001
). You can also specify a minimum number of reads regardeless of the proportion. For that you should use integer values:
$ taxMaps -f reads.fq -d ncbi.gem -z 15 ...
The command above will report taxa with at least 15
reads classified as such.
By default, taxMaps will run on the current working directory, but it is strongly recommended to specify a different output directory through the option -o
for every run.
The output directory contains the following subdirectories:
txM.{prefix}.base/
- contains either links to your read data file(s) or a.sh
script with custom input command specified using the option-i
, all the links to the specified indexes and corresponding length files, and the taxonomy table specified by-t
.txM.{prefix}.map/
- in this directory you will find all the.map
files and a.map.lca
file with the mapping results plus the taxonomic classification for every read (see below for format) .txM.{prefix}.out/
- this is where summary results will be written to. It also contains the graphical and tabular reports.txM.{prefix}.log/
- where you can find the log files forcutadapt
,prinseq
,GEM
andsamtools
(if your input data is inbam
format).
In the output directory you will also find a .sh
file with all commands that taxMaps will run. Also, if you used the -Q
option for SGE, you will find a .sge
custom specification file and an additional subdirectory will be created upon submission, containing several files related to the SGE run.
After mapping, taxMaps will merge all the .map
files and perform the taxonomic classification of every read, based on the all its best hits. These results can be found in {prefix}.merged.map.lca
located in {out_dir}/txM.{prefix}.map/
. The format is an extended GEM alignment format:
Name
- Read name.Sequence
- Read sequence.Quality
- Read quality.MatchSummary
- histogram-like:
-delimited string of the number of hits per edit distance (e.g.,0:1:0:2
means that there was1
hit at distance 1 and2
hits at distance 3).Alignments
-,
-separated list of alignments (see GEM format for complete description).TaxId
- NCBI taxon identifier.TaxRank
- NCBI taxon rank.TaxName
- NCBI taxon name.PairMappingStatus
- two character code mapping status (F
filtered out,U
unmapped,M
mapped - one per mate). *PairedTaxId
- NCBI taxon identifier. *PairedTaxRank
- NCBI taxon rank. *PairedTaxName
- NCBI taxon name. *
* The last 4 fields are exclusive for paired-end classification modes.
taxMaps then summarizes all mapping information into {prefix}.merged.map.lca.summary
located in {out_dir}/txM.{prefix}.out/
. This file consists of one-line records for every taxonomic entity identified in the sample or with at least one hit. Here are the fields:
TaxId
- NCBI taxon identifier.TaxRank
- NCBI taxon rank.TaxName
- NCBI taxon name.TaxPath
- Path from the root (1
) of the taxonomic tree, given as a:
-separated list ofTaxId
s.ReadsClassified
- Number of reads classified asTaxName
.ReadsMappedSpec
- Number of reads mapped toTaxName
in a specific manner.ReadsMappedNoSpec
- Number of reads mapped toTaxName
in a non-specific manner.PairsStrict
- Number of read-pairs classified asTaxName
, where both mates have the same mapping status (MM
,UU
orFF
asPairMappingStatus
).PairsNoStrict
Number of read-pairs classified asTaxName
, for mates with different mapping status (MF
,FM
,MU
,UM
,UF
orFU
asPairMappingStatus
).DistanceHistogram
-:
-delimited distance histogram string. For instance ifTaxName
had two reads mapping at distance0
and three at distance2
, the edit histogram string should look like this:2:0:3
.NumberOfAlignments
- Total number of alignments toTaxName
.BasesAligned
- Total number of read bases aligned toTaxName
sequences.BasesCovered
- Number ofTaxName
bases covered.TotalBases
- Total number of bases inTaxName
sequences.CoverageHistogram
- The coverage histogram string follows a log progression, with the first value representing the number of uncovered bases, the second - the number of 1x covered bases, the third - the number of 2-3x covered based, the fourth - the number of 4-7x covered bases and so on... The following string10:6432:456:0:120:0...
means that there were10
uncovered bases,6432
bases at 1x coverage,456
at 2-3x,0
between 4-7x and120
between 8 -15x coverage.
In this tab-delimited file ({out_dir}/txM.{prefix}.out/{prefix}.tbl
) you will find the read counts for the classic taxonomic ranks (kingdom
, phylum
, class
, etc...). Read counts for a given taxon are not limited to reads classified as that taxon, but also include all reads mapping to the taxa under it. These are the fields in the report table:
TaxId
- NCBI taxon identifier.TaxRank
- NCBI taxon rankTaxLevel
- Number of nodes inTaxPath
.TaxPath
- Path from the root (1
) of the taxonomic tree, given as a:
-separated list ofTaxId
s.TaxName
- NCBI taxon nameReadCounts
- Number of reads classified underTaxName
PercentFromTotal
- Corresponding percentage relative to the root of the report.
Finally, taxMaps generates an independent .html
document that allows you to explore in an interactive manner your results ({out_dir/txM.{prefix}.out/{prefix}.krona.html
). For that, you only need a Web browser and internet access. In the interactive HTML5 chart generated using Krona, taxomic entities are displayed as nested sectors from the top level (center) to the bottom (outward) of the hierarchy. Here is a screenshot:
For more information about Krona, see the project website.
If you work on a SGE environment, you can submit your taxMaps job directly to a SGE cluster by specifying the partition/queue with -Q
, the number of slots with -S
. For instance, the command:
$ taxMaps -f reads.fq -d ncbi.gem ... -Q research.q -S 20
will generate a custom file with SGE specifications and submit your taxMaps job to the research.q
queue, requesting 20
slots.
You can do a dry run of taxMaps by using the flag --dry
. taxMaps will create the output directory, link the input files and generate the master .sh
script as usual but it will not launch it. In case you used the option -Q
for a SGE run, taxMaps will also generate the SGE specification file but it will not submitt the job. Dry runs are good for debugging purposes and to ensure that the path to the input files are correct. Moreover, you can hack a taxMaps run by altering the .sh
script to your convenience and launching independently.
taxMaps
is pipeline generating script. Given an input, list of indexes and options it generates an a working directory and .sh script with all commands to be executed. If a SGE partition/queue is specified with the -Q
option, it submits the job.
--version
show program's version number and exit
-h
,--help
show this help message and exit
Input
-i
Input command (absolute paths!). Quoted. Interleaved fq
on STDOUT
. [STR, Mandatory | -f | (-1 & -2) | -b
]
-f
Input fq
, fastq
, fq.gz
or fastq.gz
. Interleaved. [CSV_FILE_LIST, Mandatory | -i | (-1 & -2) | -b
]
-b
Input .bam
file. Requires .bam.bai
in the same folder. [FILE, Mandatory | -i | -f | (-1 & -2)
]
-1
Input fq
, fastq
, fq.gz
or fastq.gz
read 1. In sync with, and of the same sort as -2
. [CSV_FILE_LIST, Mandatory | -f | -i | -b
]
-2
Input fq
, fastq
, fq.gz
or fastq.gz
read 2. In sync with, and of the same sort as -1
. [CSV_FILE_LIST, Mandatory | -f | -i | -b
]
-D
Downsampling probability. [FLOAT, 1
]
Preprocessing
-q
Cutadapt quality cutoff value (default = None). [INT,
]
-a
Adapters string. Quoted. e.g., "-a AGATCGGAAGAGC -n2 " (default = None). [STR,
]
-l
Minimum read length for mapping. [INT, 50
]
-L
Maximum read length (hard trimmming). [INT,
]
-w
5p trim length (hard trimmming). [INT,
]
-C
Entropy cutoff for low complexity filtering. Use 0 for no filtering. [INT, 70
]
-N
Filter reads with more than N% of 'N' characters. Use 100 for no filtering. [INT, 4
]
--phred64
Quality scores in Phred+64 format. [FLAG, OFF
]
Mapping
-d
Index file(s). CSV. [FILE_LIST, Mandatory
]
-e
Edit distance(s). CSV. Value(s) between 0 and 1. [FLOAT_LIST, 0.2
]
-c
Number of CPUs. CSV. [INT_LIST, 1
]
Taxonomy
-t
Taxonomic table file. [FILE, Mandatory
]
-m
Taxa determination mode ('s' single-end; 'p' paired-end; 'P' paired-end strict). [STR, s
]
Reporting
--cov
Compute coverage histograms. [FLAG, OFF
]
-x
Excluded taxids. CSV (only if --cov flag ON). [STR_LIST,
]
-z
Reporting cutoff (INT or FLOAT < 1). [FLOAT, 0.00001
]
Miscellaneous
-p
Sample prefix. [STR, sample
]
-o
Output directory. [DIR, .
]
-Q
Cluster queue/partition. [STR,
]
-S
Cluster slots (default = 20). [INT, 20
]
--dry
Dry run. [FLAG, OFF
]
taxMaps-taxtbl
is a script that generates the taxonomic table required by taxMaps
containg all taxids, corresponding descriptions and paths-from-root. It writes to stdout
.
--version
show program's version number and exit.
-h
,--help
show this help message and exit.
-n
NCBI Taxonomy names.dmp file. [FILE, Mandatory
]
-t
NCBI Taxonomy nodes.dmp file. [FILE, Mandatory
]
taxMaps-index
is a script that formats the FASTA headers and generates the GEM indexes required by taxMaps
.
--version
show program's version number and exit.
-h
,--help
show this help message and exit.
-i
Input fasta file. [FILE, Mandatory
]
-c
Correspondence file gi -> tax file. [FILE, Mandatory
]
-t
Taxonomy table file. [FILE, Mandatory
]
-p
Prefix for output files. [STR, Mandatory
]
-T
Number of threads. [INT, 1
]
-Q
Cluster queue/partition. [STR,
]
-S
Cluster slots. [INT, 1
]
--dry
Dry run. [FLAG, OFF
]
txM_fastalen
Computes the length of FASTA
sequences.
txM_fq2gem
Converts FASTQ
into GEM mapper format.
txM_fqhtrim
Trims from the 3' end of the reads down to the specified length.
txM_fqintlv
Interleaves read1 and read2 FASTQ files.
txM_fqltrim
Trims the specified number of bases from the 5' end of reads.
txM_fqmate
Checks for presence of mates when extracting from BAM
.
txM_fqminlen
Filters reads taht are shorter than the specified read length.
txM_fqsample
Downsamples FASTQ
.
txM_gem2fq
Converts GEM mapper format into FASTQ
.
txM_gitax
Formats FASTA
header for taxMaps.
txM_lca
Given a GEM mapper output, computes the LCA for all reads, either in sigle-end or paired-end mode.
txM_mapout
Filters mapped reads.
txM_mergintlv
Merges and interleaves multiple GEM mapper files assuming the pairing order is consistent across files.
txM_report
Generates a tables and plots with the number of reads mapping for all represented taxa.
txM_rescore
Rescores alignments from edit to Levenshtein distance.
txM_sge
Submits jobs specified in a custom format to a SGE cluster.
txM_summary
Generates a summary per taxon with hits, including edit distance and coverage histogram.