Skip to content

skani basic usage guide

Jim Shaw edited this page May 9, 2023 · 12 revisions

skani sketch - storing sketches/indices on disk

# sketch genomes, output in sketch_folder, 20 threads
skani sketch genome1.fa genome2.fa ... -o sketch_folder -t 20

# use sketch file for computation
skani dist sketch_folder/genome1.fa.sketch sketch_folder/genome2.fa.sketch

sketch computes the sketch (a.k.a index) of a genome and stores it in a new folder. For each file genome.fa, the new file sketch_folder/genome.fa.sketch is created.

The .sketch files can be used as faster drop-in substitutes for fasta files.

A special file markers.bin is also constructed and used specifically for the search command. Modifying the output folder may invalidate the search command.

skani dist - simple ANI calculation

# query each individal record in a multi-fasta (--qi for query, --ri for reference)
skani dist --qi -q query1.fa -r ref1.fa

# use lists of fastas, one line per fasta
skani dist --rl ref_list.txt --ql query_list.txt

# estimate confidence interval
skani dist --ci sketch_folder/genome1.fa.sketch sketch_folder/genome2.fa.sketch

# turn off the ANI debiasing step
skani dist query1.fa ref1.fa --no-learned-ani

# only compare small contigs with >= 90% estimated ANI 
# use -m 300 for better filtering on small contigs
skani dist --qi query.fa --ri ref.fa -s 90 -m 300

dist computes ANI between all queries and all references. dist loads all reference and query genomes into memory. If you're searching against a database, search can use much less memory (see below). With default settings, the entire GTDB-R207 database (65000 bacterial genomes) takes about 95 GB of RAM with dist.

If you want to do all-to-all comparisons for the same set of genomes g1.fa g2.fa g3.fa, use skani triangle g1.fa g2.fa g3.fa -E instead of skani dist -q g1.fa g2.fa g3.fa -r g1.fa g2.fa g3.fa -- it is 2x faster.

All skani ANI calculations, including dist, turns on a trained ANI debiasing step to make ANI more accurate. See the advanced usage guide for information on optimally using skani when

  • genomes are small (-m parameter becomes important)
  • genomes are very fragmented (-c becomes important)
  • and more.

skani search - memory-efficient ANI database queries

# any algorithm options used in "sketch" will also be used for searching. 
skani sketch genome1.fa genome2.fa ... -o database

# query query1.fa, query2.fa, ... against sketches in sketch_folder
skani search -d database query1.fa query2.fa ...  -o output.txt

search is a memory efficient method of calculating ANI against a large reference database. Searching against the GTDB database (> 65000 genomes) takes only 6 GB of memory using search. This is achieved by only fully loading genomes that pass a filter into memory, and discarding the index after each query is done.

The parameters for search are obtained from the parameters used for the sketch option, so if you sketch with say -c 60 for the sketch option, this will be implied in the search option.

If you're querying many sequences, the file I/O step will dominate the running time, so consider using dist instead if you have enough RAM.

skani triangle - all-to-all ANI computation

# all-to-all ANI comparison in lower-triangular matrix
skani triangle genome1.fa genome2.fa genome3.fa -o lower_triangle_matrix.txt

# output sparse matrix a.k.a an edge list of comparisons
skani triangle -l list_of_genomes.txt -o sparse_matrix.txt --sparse 

# output square matrix
skani triangle genome1.fa genom2.fa genome3.fa --full-matrix 

triangle outputs a lower-triangular matrix in phyllip format. The ANI is output to stdout or the file specified. The aligned fraction is also output in a separate file with the suffix .af attached.

triangle avoids doing n^2 computations and only does n(n-1)/2 computations as opposed to dist, so it is more efficient. It also sets some smarter default parameters for all-to-all search.

For very large data sets, use -E or --sparse instead, which only outputs non-zero entries in an edge-list output.

triangle loads all genome indices into memory. For doing comparisons on massive data sets, see the Advanced section for suggestions on reducing memory cost.