A 5 minute explainer of the algorithm (redirects to Youtube)
Taxometer is a tool that improves taxonomic annotations from any taxonomic classifier for the set of contigs. Taxometer does not use a database or map contigs to reference sequences. Instead, the underlying neural network uses tetra-nucleotide frequencies and contigs abundances from a multi-sample metagenomic experiment to identify the contigs of the same origin and then completes and refines taxonomic annotations provided by any classifier using this information.
For more explanation, motivation and benchmarks see the Taxometer article in Nature Communications https://www.nature.com/articles/s41467-024-52771-y.
Any questions, bug reports or feature requests are most welcome.
The code is a part of VAMB codebase and should be installed via cloning the Taxometer release branch:
git clone https://github.com/RasmussenLab/vamb -b taxometer_release
cd vamb
pip install -e .
Please note that this project requires Python version 3.9.0 or higher, but less than 3.12.
In the future, it should become a part of the VAMB PyPi distribution. The updates will appear here and in the VAMB main README.
The basic usecase of refining MMseqs2 annotations is the following:
vamb taxometer --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam --taxonomy mmseqs_taxonomy.tsv
The FASTA and BAM files preprocessing intructions can be found here https://github.com/RasmussenLab/vamb#how-to-run-using-your-own-input-files
Some important properties of the input files:
- BAM files must be sorted by coordinate. You can sort BAM files with
samtools sort
. - For the command above, the default taxonomy format follows MMseqs2 output, which means that the file is tab-separated, has no headers, the 1st column contains contigs identifiers and the 9th column contains the taxonomy annotations. Taxonomy annotations are text labels (e.g. "s__Alteromonas hispanica" or "Gammaproteobacteria") on all levels, sorted from domain to species, concatenated with ";". In the paper we benchmarked annotations that use GTDB and NCBI identifiers.
The taxonomic labels do not have to correspond to a particular version of any database, the only thing that is important is that the taxonomy levels are consistent within the dataset. E.g. if one contig has a taxonomic annotation
Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae;Roseburia;Roseburia hominis
then other contig annotations can be missing lower taxonomic levels, like
Bacteria;Bacillota;Clostridia
Bacteria;Bacillota;Bacilli;Bacillales
Bacteria;Pseudomonadota;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter sp. TTH0-4
but no annotations should violate the order that starts with the domain, e.g. that is incorrect:
Clostridia;Eubacteriales;Lachnospiraceae;Roseburia;Roseburia hominis
In the paper, besides MMseqs2, we benchmarked Centrifuge, Kraken2 and Metabuli tools. To use Taxometer on these classifiers, you need to convert their way to present taxonomic annotations to a format described above. One way to do it, is to use our tool Taxconverter https://github.com/RasmussenLab/taxconverter. Taxconverter accepts Centrifuge, Kraken2 or Metabuli output files and returns the Taxometer-compatible, MMseqs2-like taxonomy file.
The taxonomy file can also be formatted differently. There has to be a column with contigs identifiers and a column with taxonomic labels in a format described above. For example, your another_taxonomy.csv
file looks like this:
contigs,taxonomy
S18C13,Bacteria;Bacillota;Clostridia;Eubacteriales
S18C25,Bacteria;Pseudomonadota
S18C67,Bacteria;Bacillota;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus
In this case, add the following arguments:
vamb taxometer --outdir path/to/outdir \
--fasta /path/to/catalogue.fna.gz \
--bamfiles /path/to/bam/*.bam \
--taxonomy another_taxonomy.csv \
--column_contigs contigs \
--column_taxonomy taxonomy \
--delimiter_taxonomy ,
For more information about the different arguments, run vamb taxometer -h
The --outdir
folder will contain the result_taxometer.csv
file of the following structure:
contigs,lengths,predictions,scores
S0C93,6527,d_Bacteria;p_Myxococcota;c_Polyangia;o_Polyangiales;f_Polyangiaceae;g_Sorangium;s_Sorangium cellulosum B,1.0;1.0;1.0;1.0;1.0;1.0;0.9999716
S0C112,5109,d_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Burkholderiales;f_Burkholderiaceae;g_Massilia;s_Massilia yuzhufengensis,0.9999998;0.9999998;0.9999998;0.9999998;0.9999998;0.9999957;0.9722711
S0C133,2128,d_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Burkholderiales;f_Burkholderiaceae;g_Variovorax;s_Variovorax paradoxus,1.0;1.0;1.0;0.9999999;0.9999999;0.9999999;0.9999908
Columns:
- contigs - contig identifiers
- lengths - contig lengths
- predictions - taxonomic annotations predicted by Taxometer
- scores - assigned scores on each taxonomic level, in the range [0.5, 1]
Taxometer can be tested on the pre-processed data (~4800 contigs from one sample of CAMI2 toy Oral dataset) provided as a part of this repository. After installing Taxometer, run from the current folder:
vamb taxometer --outdir taxometer_test --composition test/data/taxometer_data/composition.npz --rpkm test/data/taxometer_data/abundance.npz --taxonomy test/data/taxometer_data/taxonomy_oral_sample0.tsv
Running this command will take a few minutes on a personal computer. A neural network will be trained from scratch, followed by inference performed and the predictions stored. GPU access is not required to run this example. The taxometer_test
folder will be created. The predicted taxonomy labels are at taxometer_test/result_taxometer.csv
. The example output is also at test/data/taxometer_data/result_taxometer_oral_sample0.csv
, but since the network is trained on the fly, the output can slightly vary between the runs.
To cite Taxometer, use the following reference:
Kutuzova, S., Nielsen, M., Piera, P. et al. Taxometer: Improving taxonomic classification of metagenomics contigs. Nat Commun 15, 8357 (2024). https://doi.org/10.1038/s41467-024-52771-y
Benchmarked taxonomic classifiers: