This repository contains WDL workflows created by the Miga lab and collaborators to annotate centromeric satellites.
If you haven't run WDL before you'll need a workflow engine like miniWDL, toil, or cromwell to run our WDL workflows on the command line.
Cromwell requires Java version 11 or later.
Take a look at the Cromwell Documentation before you get started.
Download the latest version of cromwell and make it executable. (Replace XY with newest version number)
chmod +x cromwell-XY.jar
Now you can run any WDL workflow on the command line with this template:
java -jar cromwell-XY.jar run workflow.wdl -i inputs.json > cromwellLog.txt
I recommend directing the output to a log file to make troubleshooting easier, but if you would rather everything be printed to your terminal you can remove the > cromwellLog.txt
Each of our workflows will have an example inputs.json in their directory. Here is a template for what the input json files should look like:
cat inputs.json
Any required inputs for our workflows (that aren't the assembly to be annotated) can be found in utilities/
To create a completed CenSat annotation bed file, run the cenSatAnnotation workflow.
This workflow will execute all other required workflows in this repo.
There are six main workflows that are run, all except the finalization workflow can also be run individually.
- AlphaSat annotation - This script runs Fedor Ryabov's HumAS-HMMER and summarizes the alpha satellite annotations into the following bins, active HOR, HOR, diverged HOR, and monomeric.
- HSAT2/3 annotation - This is Nick Altemose’ HSAT annotation script
- RepeatMasker Annotation - This script runs RepeatMasker on each contig from assembly and converts the output into a bed file.
- rDNA Annotation Script - This script uses an HMM built from the beginning and the end of the rDNA repeat unit and merges to find the complete annotation.
- GAP Annotation Script - This script annotates the gaps that exist in the assembly using seqtk gap
- CenSat Annotation finalization script - The script takes the file outputs of the four above scripts and combines them into a single output file. It includes logic that joins the satellites annotated by RepeatMasker, annotates the active centromere and centromere transition regions, and adds colors for easier visualization. This workflow must be run as part of the cenSatAnnotation workflow and can't be run without the inputs of the above scripts.
# clone the entire repo
git clone
# switch into correct directory
cd alphaAnnotation/cenSatAnnotation/
# run the workflow - running without changing inputs file will run on test data
# make sure to substitute the file path to the correct cromwell version
java -jar path/to/cromwell-XY.jar run centromereAnnotation.wdl -i inputs.json > cenSattest.txt
Alpha-Satellites - Annotated with Fedor Ryabov’s HumAS-HMMER and simplified into the following bins:
- Active alpha (active_hor)
- diverged HORS (dhor)
- monomeric HORs (mon)
- mixed alpha (mixedAlpha) - alpha regions that can't be sorted into above categories
Human Satellites 2 and 3 - Annotated with Nick Altemose’s HSAT2/3 script
Other Centromeric Satellite annotations - Annotated with RepeatMasker
- Gamma - includes all GSAT and TAR1 in DFAM
- Beta - BSR, LSAU, and BSAT in DFAM
- CenSat - other centromeric satellites CER, SATR, SST1, ACRO, HSAT4, HSAT5, TAF11
Centromere Transition (ct)
Centromeres are defined by merging all above satellite annotations within 2MB (bedtools merge) and then identifying the region containing the active array. Any stretch of sequence not annotated within this region is marked "ct"
For more information please reference this google document
A.F.A. Smit, R. Hubley & P. Green RepeatMasker at