This GitHub repository contains Snakemake pipelines for replicating the analysis of my Master Thesis "A performance comparison of tools to detect introgressed fragments". This involves simulating data with different parameters and then analyzing it with various tools. These pipelines were tested on the LiSC (Life Science Compute Cluster of the University of Vienna) and also on a Linux operating system (Ubuntu 20.04.2 LTS).
First of all, it is necessary to install Anaconda or Miniconda. After that, the following commands can be used to create a virtual environment:
conda config --set safety_checks disabled
conda config --set channel_priority strict
conda env create -f environment.yml
conda activate introgression-kruisz
To download the required tools that cannot be installed through conda
, the following commands can be used:
mkdir ext && cd ext
# Download SPrime and pipeline
mkdir SPrime && cd SPrime
git clone
chmod a+x sprimepipeline/pub.pipeline.pbs/tools/map_arch_genome/map_arch
cd ..
# Download SkovHMM
mkdir SkovHMM && cd SkovHMM
git clone
cd ..
It is also necessary to download the file ms.tar.gz
from Hudson Lab. Next decompress it under the ext
folder and compile it with the following commands:
cd msdir
${CONDA_PREFIX}/bin/gcc -o ms ms.c streec.c rand1.c -lm
cd ../..
The tool sstar is already installed as a package using the previously created environment.
After installing the tools, it is recommended to test the pipeliens locally beforehand with a dry-run. This can be done, for example, with the following command:
snakemake -s workflows/1src/simulation_proportion.snake -np
It is also recommended to run the individual pipelines one after the other. For the first part, the data simulations with the different parameters, the following commands can be used (-c
specifies the number of threads):
snakemake -s workflows/1src/simulation_proportion.snake -c 1
snakemake -s workflows/1src/simulation_introtime.snake -c 1
snakemake -s workflows/1src/simulation_divtime.snake -c 1
For the second part, going through the individual tools SPrime, SkovHMM and sstar, the following commands (in any order) can be used:
snakemake -s workflows/1src/sprime_proportion.snake -c 1
snakemake -s workflows/1src/sprime_introtime.snake -c 1
snakemake -s workflows/1src/sprime_divtime.snake -c 1
snakemake -s workflows/1src/sstar_proportion.snake -c 1
snakemake -s workflows/1src/sstar_introtime.snake -c 1
snakemake -s workflows/1src/sstar_divtime.snake -c 1
For the SkovHMM pipeline it is necessary to create a custom environment, because it requires Python2
. Therefore '--use-conda' must be added to the command:
snakemake -s workflows/1src/skovhmm_proportion.snake --use-conda -c 1
snakemake -s workflows/1src/skovhmm_introtime.snake --use-conda -c 1
snakemake -s workflows/1src/skovhmm_divtime.snake --use-conda -c 1
If the pipelines are also to be sent as jobs to a cluster, then it is recommended to perform the following steps:
Create a profile depending on the cluster.
The profile used in case for the SLURM Workload Manager
is located in config/slurm/config.yaml
Creating a folder named logs_slurm
and then submitting the jobs with, for example, the following commands (-j
specifies the number of threads (cluster)):
mkdir logs_slurm
snakemake -s workflows/1src/simulation_proportion.snake --profile config/slurm -j 200
It is also recommended, since the pipelines run for a while, to use the nohup command, which suppresses the HUP-signal and thus allows a program to continue running even if you have logged off the system. This works with the following example command:
nohup snakemake -s workflows/1src/simulation_proportion.snake --profile config/slurm -j 200 &
There is also a small script to create individual plots. This pipeline can be executed with the following commands:
snakemake -s workflows/plots/plots_proportion.snake
snakemake -s workflows/plots/plots_introtime.snake
snakemake -s workflows/plots/plots_divtime.snake
In the course of this work, a storage capacity of 10 TB and a files limit of two million were used. In our case, we had to pay attention to the storage limits and delete intermediate results. The results of the data simulations must not be deleted until the end, since these are needed for the individual analyses. Therefore, the intermediate results were deleted after the respective analyses by the individual tools and only the final accuracy tables were kept. SPrime and SkovHMM require a high files limit and sstar a high storage capacity.