Skmer is a fast tool for estimating distances between genomes from low-coverage sequencing reads (genome-skims), without needing any assembly or alignment step. The paper where we have described the methods and tested Skmer on simulated short reads and SRA's from previous sequencing experiments is available online (open access):
And the paper where we have described the procedure for estimating branch support for phylogenies generated using Skmer is here:
Skmer is a command-line tool implemented in python. It runs Jellyfish and Mash internally to efficiently compute k-mer profile of genome-skims and their intersection, and estimates the genomic distances by correcting for the effect of low coverage and sequencing error. Skmer also depends on seqtk for some FASTQ/A processings. After following the installation steps and going over the usage guide, check out these awesome tutorials by Siavash Mirarab to learn more about using Skmer and other tools we have developed for genome skimming.
On 64-bit Linux and Mac OSX, you can install Skmer from bioconda channel using conda package manager.
- Install Miniconda (you can skip this if you already have either of Miniconda or Anaconda installed).
- Add the bioconda channel by running the following commands in your terminal (order matters):
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
- Run the following command to install Skmer (and all dependencies)
conda install skmer
Alternatively, and for all other OS and architectures, you can download the github repository and install Skmer using the setup script.
- You need to have python 2.7 or later installed
- Install Jellyfish (v2.2.6 or later), Mash (v2.3 or later), and seqtk (v1.3), and add the path to
their binary to the system path (so you can, e.g., run
jellyfish --version
,mash --version
, andseqtk
successfully in the terminal). - Clone the github repository by running (or you can download the repo)
git clone https://github.com/shahab-sarmashghi/Skmer.git
- Change to the Skmer directory and run
python setup.py install
Skmer has five sub-commands:
Gets the path to a directory of FASTQ/FASTA files (one uncompressed .fastq/.fq/.fa/.fna/.fasta file per each sample) and creates a reference library containing the estimates of sequencing parameters as well as the Mash sketch for each sample. If you have paired-end sequencing reads, we suggest using tools such as BBMerge to merge overlapping read pairs. If the input is an assembled sequence (determined by the length of sequences) the correction for low coverage and sequencing error is not applied to that sample. All corrected pairwise genomic distances are then estimated and written to a file. For a test run, change the directory to data
under your Skmer installation directory, and run
skmer reference ref_dir -p 4
The genome-skims and assemblies in ref_dir
directory are processed (using 4 cores in parallel), and a reference library
is created in the working directory. You can specify a custom name (and so its path) for your library using -l
option
skmer reference ref_dir -l custom_library_name
Default k-mer size is set to 31
which is the maximum length allowed by Mash, and can be changed using -k
option. We do not recommend using k-mers smaller than ~21
, as k-mers without any shared evolutionary history start to seem similar just out of random chance. The sketch size can also be changed using -s
option from its default value 10000000
. Decreasing the sketch size will reduce the size of library on disk, but also compromises the accuracy of distance estimation. The corrected pairwise distances are estimated and written to the file ref-dist-mat.txt
in the working directory by default. The output prefix can be changed using -o
option
skmer reference ref_dir -o output_prefix
If distances are going to be used to build phylogenies, it is recommended to use -t
flag. In this case, the estimated distances are transformed to the phylogenetic distances using the Jukes-Cantor model of substitution. Run skmer reference -h
for the complete list of arguments and options.
Computes all pairwise distances for a processed library. The main usage is to compute distances when combining already processed libraries, otherwise reference
command outputs distances as well when the input files are processed. For example, in data
directory, assuming that you have already run reference
and compiled the reference library
, try
skmer distance library -t -o jc-dist-mat
The distances between all samples in the library
are computed, and after applying Jukes-Cantor transformation, the mutation rates are written to jc-dist-mat.txt
, which can be later used to build a phylogeny based on distances. To see the help message, run skmer distance -h
.
Processes a query genome-skim or assembly, and outputs the sorted list of reference samples based on their distance to the query. Optionally, the query can be added to the reference library. To test its function, assuming that you have already run reference
and compiled the reference library
, in data
directory run
skmer query qry.fastq library
The sorted list of reference species and their distances from the query is written to dist-qry.txt
. You can change the output prefix from dist
to something else using -o
option
skmer query qry.fastq library -o output_prefix
If you want to add the processed query to the reference library and include it as a reference for future comparisons, use -a
flag. To see the complete list of inputs and options, run skmer query -h
.
Gets the path to a directory of FASTQ/FASTA files (one uncompressed .fastq/.fq/.fa/.fna/.fasta file per each sample) and performs subsampling procedure. Function will create subsample
directory containing replicate subfolders rep0
, rep1
etc.
skmer subsample ref_dir
A number of additional paramters can be specified. -S
allows to provide custom seed that will be used to generate a list of seeds for each subreplicate (default is 42). With option -sub
the user can define directory of output for subsample replicates (default is working_directory/subsample
). -b
option can be used to indicate subreplicate count (by default value is set to 100). -i
allows to specify index of the first replicate (default is 0).
skmer subsample -b 100 ref_dir -s 100000 -S 42 -p 24 -t -i 0
To see the complete list of inputs and options, run skmer subsample -h
.
- Job parallelization: Combinations of
-b
and-i
allows a flexible job parallelization. For example, to test large dataset user can run subsampling in chunks by specifying-b 10 -i 0 -S 14500
(generates 10 subreplicates starting at index 0 such as first repository is rep0 and others are rep1, rep2 ... rep9),-b 10 -i 10 -S 13800
(generates 10 replicates starting at index 10 such as subrepicates are rep10, rep11 ... rep19). Here we note that since internally Skmer uses default seed 42 when subsampling job is split variable seed is needed to be specified otherwise replicates will come out the same. - Sample uniformity: At the moment subsample only works for cases where all samples are either assemblies or sequencing reads. If there is a need to run a combination the user can simulate sequencing reads from assemblies. We recommnd tool such as ART for this purpose.
Performs correction of subsampled distance matrices obtained for reference genome-skims or assemblies. Since distance matrices are precomputed this step is fast. Output is this command is a set of corrected distance matrices for main estimate and subreplicates.
Input main distance matrix can have any filename. After correction new matrix file will be created with the name appended with the suffix _cor_
. We note that main distance matrix remains unchanged and correction only involves rounding of the values up to 12 significant digits to ensure that output is compatible with downstream tools like FastMe.
Correction algorithm looks for dimtrx_rep.txt
file in each subreplicate directory. For every subreplicate matrices with both types of correction are generated. Corrected distance matrices are appended with suffixes _cor
and _cor_cons
which corresponds to main and consensus correction respectively.
skmer correct -main jc-dist-mat -sub subsample_dir
-main
option takes as an input distance matrix file for main estimate before subsampling. This should be computed using standard reference
command. -sub
is used to specify location of subsample
directory. These options have no default settings.
We suggest the following workflow to obtain Skmer distance matrices for sequencing reads or assemblies.
- To obtain main estimate distance matrix before subsampling:
skmer reference ref_dir -s 100000 -S 42 -p 24 -t -o dimtrx_main
- To generate subreplicates:
skmer subsample -b 100 ref_dir -s 100000 -S 42 -p 24 -t -i 0
- To correct estimates:
skmer correct -main path_to_file/dimtrx_main.txt -sub path_to_directory/subsample
In order to be compatible with downstream software, standard square distance matrices should be converted into PHYLIP format.
Example of distance matrix in standard square form
sample Alpha Beta Gamma Delta Epsilon
Alpha 0.000 1.000 2.000 3.000 3.000
Beta 1.000 0.000 2.000 3.000 3.000
Gamma 2.000 2.000 0.000 3.000 3.000
Delta 3.000 3.000 3.000 0.000 1.000
Epsilon 3.000 3.000 3.000 1.000 0.000
Example of distance matrix in PHYLIP format
5
Alpha 0.000 1.000 2.000 3.000 3.000
Beta 1.000 0.000 2.000 3.000 3.000
Gamma 2.000 2.000 0.000 3.000 3.000
Delta 3.000 3.000 3.000 0.000 1.000
Epsilon 3.000 3.000 3.000 1.000 0.000
One way to perform the formatting is to run our custom script that is available in data folder:
bash tsv_to_phymat.sh dimtrx_original.txt dimtrx_reformatted.txt
FastME can be used to infer phylogenies from reformatted distance matrices. With default parameters the program can be launched:
fastme -i input_data_file -o output_tree_file
In our case input_data_file contains reformatted distance matrix and output_tree_file contains computed backbone tree.
To concatenate output phylogenies for subreplicates the user can run the following commands:
For main correction
cat subsample/rep*/dimtrx_rep_cor.txt.tre > bootstrap.All
For consensus
cat subsample/rep*/dimtrx_rep_cor_cons.txt.tre > bootstrap.All_consensus
To generate phylogeny with support values we suggest to use RAxML.
To compute phylogeny using main correction the user can draw bipartition information on a tree provided with -t (e.g., main Skmer estimate) based on multiple trees (e.g., from subsampled replicates) in a file specified by -z using command:
raxmlHPC -f b -m GTRCAT -z bootstrap.All -t dimtrx_main_cor_.txt.tre -n BS_TREE_MAIN
To compute extended majority rule consensus tree with "-J MRE" out of the subsampled replicates the user can run:
raxmlHPC -J MRE -z bootstrap.All_consensus -p 4424 -m GTRCAT -n BS_TREE_CONS
Note: some visualization tools, for instance FigTree, are not compatible with the consensus output. If there is a need to make it work, phylogeny file can be reformatted using:
sed -E 's/([:][0-9]+[.][0-9]+)[[]([0-9]+)[]]/\2\1/g' RAxML_MajorityRuleExtendedConsensusTree.BS_TREE_CONS > RAxML_MajorityRuleExtendedConsensusTree.BS_TREE_CONS_fixed
Check out the following tutorials for more on how to use Skmer and other tools we have developed for different genome skimming applications.