The pipeline below includes algorithms developed bythe Hartwog MEdical Foundation (HMF) that run on tumor-normal WGS data sets. The pipeline detects somatic SNV calling with SAGE, somatic SV calling with GRIDSS (and filtering with GRIPSS), somatic copy number aberrations calling with PURPLE and somatic SV interpretation with LINX. The pipeline works by submitting mutiple jobs with dependencies starting from BAM files aligned using BWA-MEM. You can read more about HMFtools at https://github.com/hartwigmedical/hmftools
For an example of the results genenerated using this pipeline, see our study on osteosarcoma evolution:
Espejo Valle-Inclán, De Noon et al. Cell, 2025. Link to the open access version of the article: https://www.cell.com/cell/fulltext/S0092-8674(24)01418-1
Create a conda environment with:
conda env create -f hmf.yml
(this step might take a while).
When the environment is installed, there is a small hack needed to fix a library dependency (circos is picky about the version but conda does not seem to realize). To get around it you need to find your conda environment and make a symlink:
conda activate hmf
condaBin=$(dirname $(which PURPLE))
ln -sf ${condaBin}/../lib/libwebp.so.7 ${condaBin}/../lib/libwebp.so.6
You should load the conda environment first. Then, you will need to provide the path to the tumor and normal BAM files and an output directory:
conda activate hmf
sh runHMF.sh -t path/to/tumorBam -n path/to/normalBam -o path/to/outputDir
All appropriate sub-directories will be created in outputDir. If needed, you can change a lot of memory, threads and parameters in an ini file. I think it's good practice to create an ini file per project to be able to go back at the exact parameters. Copy hmf.ini to your preferred dir, modify what you need and then provide the path with:
-i path/to/iniFile
You can add your email address if you want to receive a message when the pipeline finishes, detailing if the steps were succesfully completed:
-m email@ebi.ac.uk
Full usage is:
Usage: run_HMF.sh [options] -t <tumor.bam> -n <normal.bam> -o <outputDir>
Required parameters:
-t/--tumorBam: path to tumor BAM file.
-n/--normalBam: path to normal BAM file.
-o/--outputDir: path to output directory (will be created).
Optional parameters:
-h/--help: show this usage help.
-i/--iniFile: path to ini file [/hps/research1/icortes/jespejo/hmf-pipeline/hmf.ini]
-m/--mail: Add an email to send a final report on the pipeline []
-r/--reference: reference genome to use [/hps/research1/icortes/DATA/hg38/Homo_sapiens_assembly38.fasta]
--id: Specific ID to append to job names [random string]
Please note that if one step downstream fails, when re-running the pipeline will pick up where it failed. Here an example of how I ran a PCAWG tumor-normal pair (I used the default ini-file):
conda activate hmf
sh /hps/research1/icortes/jespejo/hmf-pipeline/run_HMF.sh \
-t test/DO220842/bam/Tumor_SA557318.sorted.bam \
-n test/DO220842/bam/Normal_SA557554.sorted.bam \
-o test/DO220842/hmf_full/ \
-m jespejo@ebi.ac.uk
The HMF pipeline runs a bunch of specific tools and then everything comes together with PURPLE. Therefore, that's primarily where you need to go for the end files.
The first step is to check that all the tools needed are in path. It also assigns a random 10-character string to each individual run, to ensure dependencies don't collide. It creates the output directory with a log directory inside, and will write a version.log file with the package versions. It also gets the tumor and normal sample names from the corresponding BAM files.
SAGE is a somatic SNV, MNV and indel caller. Details are in: https://github.com/hartwigmedical/hmftools/blob/master/sage/README.md The SAGE output is also annotated with SnpEff and with the HMF-PON.
SAGE can also perform germline calling. Details are in: https://github.com/hartwigmedical/hmftools/blob/master/sage/GERMLINE.md The SAGE output is filtered like mentioned in the link and annotated with SnpEff.
Amber checks the BAF of the tumor/normal pair of likely heterozygous loci. Details in: https://github.com/hartwigmedical/hmftools/blob/master/amber/README.md
Cobalt checks the read depth of the tumor/normal pairs while taking into account GC content. Read more at https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md
GRIDSS is an SV caller. It will call SVs on the tumor and the normal jointly, than will be then filtered for somatic SVs downstream. The output is also annotated with repeat regions and viral integration evidence. You can read more about GRIDSS in: https://github.com/PapenfussLab/gridss
GRIPSS applies a set of filtering and post processing steps on GRIDSS paired tumor-normal output to produce a high confidence set of somatic SV for a tumor sample. GRIPSS processes the GRIDSS output and produces a somatic vcf. You can read more at: https://github.com/hartwigmedical/hmftools/blob/master/gripss/README.md
PURPLE is a purity-ploidy estimator, but also a CNA caller and integrates all the data from the tools upstream. It will generate annotated somatic SNV and SV VCF files with CNA information. It also generates sample QC (purity, ploidy, WGD, microsatellite status, contamination), detailed purity estimation files, segmented copy number estimation, copy number per gene and a driver catalog. It generates also circos plots with a lot of information, and informative model-fitting charts. Everything is well-explained here: https://github.com/hartwigmedical/hmftools/blob/master/purple/README.md
LINX is an annotation, interpretation and visualisation tool for structural variants. The primary function of LINX is grouping together individual SV calls into distinct events and properly classify and annotating the event to understand both its mechanism and genomic impact. Read more at: https://github.com/hartwigmedical/hmftools/blob/master/sv-linx/README.md
Mosdepth will create coverage distributions for the normal and the tumour samples. It is performed in 10kb windows genome wide. Read more at: https://github.com/brentp/mosdepth
The pipeline will generate Gbs of intermediate files, mostly through GRIDSS. If you are sure your pipeline succesfully finished, you can clean up the output directory removing such files with:
sh clean_HMF.sh -o <outputDir>
Usage: clean_HMF.sh [options] -o <outputDir>
Cleans up the HMF run directory, leaving just the input files needed to regenerate PURPLE and LINX output.
Required parameters:
-o/--outputDir: path to output directory of the HMF run
Optional parameters:
-h/--help: show this usage help.
-f/--force: Ignore done files sanity check and force removal.
Some library is more advanced that circos wants. To get around it you need to find your conda environment and make a symlink:
conda activate hmf
condaBin=$(dirname $(which PURPLE))
ln -sf ${condaBin}/../lib/libwebp.so.7 ${condaBin}/../lib/libwebp.so.6
You can always check the status of the dependencies for a particular job with:
bjdepinfo <job ID>
If a job upstream has run out of memory or failed for whatever other reason there are two options:
- Remove the job from the queue with
bkill <jobID>
and resubmit the pipeline giving it more memory or after solving the error.
- Run the job manually (commands are in the log folder) and remove the dependencies for the stuck job using:
bmodify -wn <jobID>
Sometimes the second option is nicer if the last steps of the pipeline fail, if its more upstream it is maybe more complex.