focusDB is a package built for the construction of species-specific, high-resolution 16S rDNA databases. It does so with through the use of riboSeed, a pipeline for the use of ribosomal flanking regions to improve bacterial genome assembly.
focusDB sets up a directory in your home folder to store downloaded SRAs. This makes rerunning easier, as all the raw data is centralized for reuse. This keeps the data transfer load to a minimum. A manifest file keeps track of the runs already downloaded.
Given an organism, focusDB does the following:
- Identify all the whole-genome seqeuncing SRAs available for that species
- Download all potential complete reference genomes.
For each SRA:
- use plentyofbugs
to identify which reference genome would be the closest
- uses the mini assembly from of plentyofbugs
to call taxonomy via kraken2
- run QC on reads, downsampling if neccessary
- run riboSeed assembly
Extract all the 16S seqeunces from reassembly
focusDB available via pypi. We recommend installing within a python environment.
conda create --name focusDBenv python=3.5 seqtk sickle-trim sra-tools riboseed mash skesa barrnap iqtree mafft kraken2 fastp
conda activate focusDBenv
pip install focusDB
Optionally, to use the trimming alignment feature, TrimAl must be installed from github https://github.com/scapella/trimal. For re-generating test data, ART read simulator must also be installed.
This will go through the process of setting up the .focusDB
dir (defaults to your home directory), downloading up to 30 reference complete genomes, downloading up to 5 WGS SRAs, finding the closes referece for each of the 5 SRAs, assembling, and extracting the 16S sequences. Say we have a 4 core computer with 16 gb ram, we spilt it so the assemblies each use half the resources.
focusDB --output_dir ./focusdb_ecoli/ --n_SRAs 5 --n_references 30 --memory 8 --cores 2 --njobs 2 --organism_name "Escherichia coli"
# build E. coli specific DB from E colis in Silva and our new seqeunces
combine-focusdb-and-silva -d ~/Downloads/SILVA_132_SSUParc_tax_silva.fasta -o ecolidb.fasta -n "Escherichia coli" -S ./focusdb_ecoli/ribo16s.fasta
# Align sequences and trim the alignment
align-and-trim-focusdb -i ecolidb.fasta --out_prefix aligned_ecolidb
# Calculate the per-column shannon entropy of the trimmed alignment.
calculate-shannon-entropy calculate-shannon-entropy.py -i aligned_ecolidb.mafft.trimmed > ecoli_entropy
[--organism_name]: The species of interest, input within quotes.
[--n_references]: Maximum number of reference genomes the user wishes to download.
[--n_SRAs]: Maximum number of SRAs the user wishes to download.
[--sra_list]: Uses a user-given list of SRA accessions instead of obtaining SRA accessions from the pipeline.
[--version]: Returns focusDB version number.
[--approx_length]: Uses a user-given genome length as opposed to using reference genome length.
[--sraFind_path]: Path to pre-downloaded sraFind-All-biosample-with-SRA-hits.txt file.
[--prokaryotes]: Path to pre-downloaded prokaryotes.txt file.
[--get_all]: If one SRA has two runs, downloads both.
[--cores]: The number of cores the user would like to use for focusDB. Specifically, riboSeed and plentyofbugs can be optimized for thread usage.
[--memory]: As with [--cores], RAM can be optimized for focusDB.
[--maxcov]: The maximum read coverage for SRA assembly. Downsamples to this coverage if the coverage exceeds it.
[--example_reads]: Input of user-given reads.
[--subassembler]: Choice of mash or skesa for subassembly in riboSeed.
Python's multiprocessing does not play well with SGE: to run efficiently, use --sge
mode, provide the conda environment name with --sge_env
. FocusDB will write out a bash script for all the assemblies after processing all the reads. qsub
the script, and when it finishes, re-run focusDB with the same parameters, and it will finish processing the data
Use this script to combine silva and focusDB seqeunces for a given organism name.
usage: align-and-trim-focusdb [-h] -i INPUT -o OUT_PREFIX
Given a multiple sequence database (from combine-focusdb-and-silva, generate
an alignment with mafft and trim to median sequence. Requires mafft and TrimAl
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
multifasta input file
-o OUT_PREFIX, --out_prefix OUT_PREFIX
prefix for msa and trimmed msa
usage: calculate-shannon-entropy [-h] -i INPUT
Given a trimmed multiple sequence alignment (from align-and-trim-focusdb,
calculate shannon entropy
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
trimmed MSA
Reassembling hundreds or thousands of genomes can eat into a ton of resources, so we try to make this a feasible. For optimal running
- install Aspera connect. NCBI uses this to transfer data faster than the default connections with http or ftp.
- set up your cache somewhere with lots of space and fast I/O. run
vdb-config -i
, following the instructions here: https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration - Use the focusDB-prefetch command to get your data. This will download the data as
.sra
, and add them to your cache. This can be configured to run the requests in batches. - Run focusDB: the calls to
fasterq-dump
will now go to the local copy of the.sra
in the cache area.
Testing is done with the nose
package. Generate the test data with
nosetests pyfocusDB/generator.py
and run the unit tests with
nosetests pyfocusDB/ -v
Note that generator.py
requires ART to generate synthetic.
{https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm}
If you get a failure running riboSeed about dependencies not installed:["numpy"]
, try running python -c "import numpy as np"
. If you get an error about multiple versions being installed, repeat pip uninstall numpy
repeatedly until no more versions remain in your environment. Then, reinstall numpy and try again.
If you get an error about openblas, try upgrading the one chosen by conda with:
conda install openblas=0.2.19
The default behavior for identifying organisms of interest from NCBI's prokaryotes.txt
is to find lines starting with --organism_name
. This is intentional, as the names are poorly defined in the file, and this allows us to capture a whole genus if desired. We have never come across a case where this happens, but this could have the consequence of including undesired organisms, if they start with the same characters. For instance, an --organism_name
of dog
would also match dogfish
. If you notice undesired organisms included in the log file in our output directory, you will have to manually select the lines of interest from prokaryotes.txt
, save that as an alterative file, and use the --prokaryotes
argument to provide this edited version to focusDB.