Scripts to sketch KEGG and explore using FracMinHash as a way to functionally annotate a metagenome.
The requirements.txt
file contains the necessary packages required to run the code in this repo.
You can install it via:
conda create -y --name KEGG_env python=3.8
conda install -y --name KEGG_env -c conda-forge -c bioconda --file requirements.txt
conda activate KEGG_env
The data this repo uses is currently on the lab GPU server, under /data/shared_data/
.
This was extracted with this repo.
Sketch the AA and NT sequences, compare the following approaches:
- AA (vary the k-mer size and sketch size) with
sourmash gather
- NT (vary the k-mer size and sketch size) with
sourmash gather
- AA/NT prefetch, then alignment
- Cmash/mash screen approach: sketch the database, stream the query over it looking for all matches
Using the following query data (in increasing level of realism):
- Simulated subsequences of the AA/NT database
- Simulated metagenomes that have AA/NT database sequences spiked in
- Full simulated metagenomes (AA+prodigal/AA+6-frame-translation/NT)
- Mock metagenomes
- Real metagenomes that have been extensively analyzed previously
src
contains the module with various helper functions, wrappers to sourmash, etc.
scripts
contains the scripts that perform various tasks (sketching, simulation, etc.)
utils
contains third party tools (in this case, bbtools) for things like making simulations.
test_data
contains a small portion of real KEGG data for rapid testing and development.
make_sketches.py
This will take the FAA and FNA files and create sketches/signatures for each one.simulate_metagenome.py
simulates a metagenome when provided with a reference database. This will need some work since currently bbtools uses the whole database to simulate a metagenome (i.e. coverage is very low), so we'll need to do some down-sampling. This will also create a ground-truth functional profile (though that needs to be checked for accuracy).classify_and_report.py
is intended to run sourmash gather, parse the results, and then compare this to the ground truth.
Running the following will print out binary classification results:
./classify_and_report.py -r ../test_data/input/kegg_genes_KO.fna -m ../test_data/output/test_simulation.fq -o ../test_data/output/ -k 21 -t 100 -n 1000 --ref_scale_size 100 --query_scale_size 100
which will result in something like:
=======
{'TP': 97, 'FP': 0, 'FN': 3, 'precision': 1.0, 'recall': 0.97, 'F1': 0.9847715736040609}
DIAMOND is an alignment-based functional annotation software package that is widely used. Let's compare the sourmash/FracMinHash approach to DIAMOND.
Updated from here:
# downloading the tool
wget http://github.com/bbuchfink/diamond/releases/download/v2.0.15/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
# creating a diamond-formatted database file
./diamond makedb --in reference.fasta -d reference
# running a search in blastp mode
./diamond blastp -d reference -q queries.fasta -o matches.tsv
# running a search in blastx mode
./diamond blastx -d reference -q reads.fasta -o matches.tsv
# downloading and using a BLAST database
update_blastdb.pl --decompress --blastdb_version 5 swissprot
./diamond prepdb -d swissprot
./diamond blastp -d swissprot -q queries.fasta -o matches.tsv
After making the simulation, run via:
./classify_and_report_diamond.py -r ../test_data/input/kegg_genes_KO.faa -m ../test_data/output/test_simulation.fq -o ../test_data/output/
And you should get a result like
{'TP': 98, 'FP': 110, 'FN': 2, 'precision': 0.47115384615384615, 'recall': 0.98, 'F1': 0.6363636363636364, 'Percent correct alignments': 0.8630737190242755, 'Total number of alignments': 1156228, 'Total number of sequences': 1000000.0}
Following the guide here (thanks Shaopeng!) to set things up.
https://github.com/biobakery/shortbred
c3db5216001d872ba169b89ab0d16cccf774dd78