-
Notifications
You must be signed in to change notification settings - Fork 10
Tutorial: setting up the GTDB genome database to search against
The following tutorial shows how to search genomes with skani against a large database (> 85,000 genomes) in seconds.
You can download a preprocessed version of the GTDB-R214 database for searching: the .tar.gz file is available here.
wget https://storage.googleapis.com/skani_files/skani-gtdb-r214-sketch-v0.2.tar.gz
tar -zxvf skani-gtdb-r214-sketch-v0.2.tar.gz
skani search my_genome.fa -d skani-gtdb-r214-sketch-v0.2 -o results.tsv
The below tutorial will be faster if you already have the GTDB database available.
Note
GTDB-R220 is a new version of the GTDB database that supersedes GTDB-R214. This tutorial still uses GTDB-R214, but you can replace it with GTDB-R220 if desired.
Requirements
- skani is installed and in PATH (i.e. typing
skani -h
works). See skani's README for how to install skani. - A decent internet connection; we'll be downloading a database of > 70 GB.
- Around ~120 GB total disk space in total for the database and indexing.
What we will do
- set up the GTDB representative database, a large database with > 85,000 genomes, from scratch
- show how skani can search genomes/MAGs against the gtdb database in seconds
The database is > 70 GB compressed, so this may take a while to download and unzip.
# version R214
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
tar -xf gtdb_genomes_reps_r214.tar.gz
# new version if desired
# wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/genomic_files_reps/gtdb_genomes_reps_r220.tar.gz
# tar -xf gtdb_genomes_reps_r220.tar.gz
The gtdb database is formatted in a special way. In order to process the reference genome files inside the gtdb folder, we have
to do a bit of work. We can run the following to collect all genomes locations into a file called gtdb_file_names.txt
.
find gtdb_genomes_reps_r214/ | grep .fna > gtdb_file_names.txt
The above command puts all reference genomes into a file, with each line containing one reference genome file.
To construct the database called gtdb_skani_database_ani
, use
skani sketch -l gtdb_file_names.txt -o gtdb_skani_database_ani -t 20
where I am using -t 20
for 20 threads, but feel free to use less if your workstation is limited.
The database is now available in the folder gtdb_skani_database_ani
. Let's try searching against it.
First, we'll download an E. coli K12 genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
Now we can simply search against the database using
skani search GCF_000005845.2_ASM584v2_genomic.fna.gz -d gtdb_skani_database_ani
and you should see the following output (numbers differ slightly depending on the version of skani used or version of gtdb)
Ref_file Query_file ANI Align_fraction_query Align_fraction_reference Ref_name Query_name
gtdb_genomes_reps_r214/database/GCF/003/697/165/GCF_003697165.2_genomic.fna.gz GCF_000005845.2_ASM584v2_genomic.fna.gz 96.92 77.63 84.20 NZ_CP033092.2 Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 chromosome, complete genome NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
...
The output is sorted by the closest ANI reference to the query, so the closest E. coli genome in the database is given by the first line. Each column of the output gives some information. The name of the reference organism is given by the second last column, which is E. coli DSM 30083. The columns
ANI Align_fraction_query Align_fraction_reference
96.92 77.63 84.20
tell us that the ANI between these two genomes are 96.92%, the query E. coli had 77.63% of its genome covered by alignments, the reference genome had 84.20% of its genome covered by alignments. The exact numbers may differ depending on which version of skani is used.
This only took ~6 seconds for my system and < 6 GB of memory:
Command being timed: "skani search GCF_000005845.2_ASM584v2_genomic.fna.gz -d gtdb_skani_database_ani"
User time (seconds): 6.98
System time (seconds): 1.70
Percent of CPU this job got: 136%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.33
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5894392
and that's it! See Tutorial: classifying entire assemblies (MAGs or contigs) against 85,000 genomes in under 2 minutes for a second tutorial on how to use skani for classifying MAGs and contigs from assemblies.