Species Probes Generator - scripts to generate probe sequences for species identification.
The idea is you provide a set of genomes and their species names, a
taxonomy for the species, and sprog
can make a set of sequence probes that
are unique to each node in the tree.
sprog
was developed to generate Mycobacterium probes for
mykrobe. However, the method
is not specific to mykrobe and can be used more generally for any species.
The method relies heavily on
Bifrost
.
Unitigs are generated from a graph
of all the input genomes. Then the presence of these unitigs in each
species is determined
(where there is more than one genome for a species, a unitig
must be found in all the genomes to be assigned to that species).
This presence information is combined with the taxonomy to produce
a set of sequences that are unique to node in the tree, in FASTA format.
sprog
is written in Python. It assumes Bifrost
is installed and in your
$PATH
.
Install sprog
by taking a copy of this repo and running:
python3 -m pip install .
Optional dependencies:
- datasets - this is the command line tool from NCBI. You can either provide
sprog
with your own FASTA files, in which casedatasets
is not required. Alternatively, provide genbank or assembly accessions.datasets
is used to download assemblies (genbank sequences are downloaded usingwget
) - mash
- fastANI
If mash
and/or fastANI
are installed, then you can use the sprog
commands
mash_all
/fastani_all
to run all genomes vs all genomes using these tools.
This is not required to build probes, but may be useful for sanity checking.
It can help identify species where a genome has a better match to a
different species.
This section is under construction.
For generating mykrobe probes, please see the section below.
The following commands can be used to make mykrobe species probes for
use with mykrobe predict
and species tb
. It is split over several
scripts because the memory and multi-cpu usage varies significantly
between stages.
Download data from GTDB (<100M RAM, a few minutes run time)
sprog mykrobe_from_gtdb GTDB_data
Download genomes (<100M RAM, approx 30 minutes to run depending on bandwidth)
sprog get_data Samples GTDB_data/manifest.tsv
Build graph of all samples and extract unitigs (17GB RAM, 1h wall clock)
sprog build_all_samples_unitigs --threads 30 Samples
Assign unitigs to each species (3GB RAM, 6h wall clock)
sprog per_sample_presence --threads 30 Samples
Make mykrobe probes (3GB RAM, 20m run time)
sprog make_probes --mykrobe_tb Samples GTDB_data/tree.tsv