a pangenome-scale aligner
wfmash
is an aligner for pangenomes based on sparse homology mapping and wavefront inception.
wfmash
uses a variant of MashMap to find large-scale sequence homologies.
It then obtains base-level alignments using WFA, via the wflign
hierarchical wavefront alignment algorithm.
wfmash
is designed to make whole genome alignment easy. On a modest compute node, whole genome alignments of gigabase-scale genomes should take minutes to hours, depending on sequence divergence.
It can handle high sequence divergence, with average nucleotide identity between input sequences as low as 70%.
wfmash
is the key algorithm in pggb
(the PanGenome Graph Builder), where it is applied to make an all-to-all alignment of input genomes that defines the base structure of the pangenome graph.
It can scale to support the all-to-all alignment of hundreds of human genomes.
Each query sequence is broken into non-overlapping pieces defined by -s[N], --segment-length=[N]
.
These segments are then mapped using MashMap's mapping algorithm.
Unlike MashMap, wfmash
merges aggressively across large gaps, finding the best neighboring segment up to -c[N], --chain-gap=[N]
base-pairs away.
Each mapping location is then used as a target for alignment using the wavefront inception algorithm in wflign
.
The resulting alignments always contain extended CIGARs in the cg:Z:*
tag.
Approximate mappings can be obtained with -m, --approx-map
.
Sketching, mapping, and alignment are all run in parallel using a configurable number of threads.
The number of threads must be set manually, using -t
, and defaults to 1.
wfmash
has been developed to accelerate the alignment step in variation graph induction (the first step in the seqwish
/ smoothxg
pipeline).
Suitable default settings are provided for this purpose.
Seven parameters shape the length, number, identity, and alignment divergence of the resulting mappings.
These parameters affect the structure of the mappings:
-s[N], --segment-length=[N]
is the length of the mapping seed (default:1k
). The best pairs of consecutive segment mappings are merged where separated by less than-c[N], --chain-gap=[N]
bases.-l[N], --block-length-min=[N]
requires seed mappings in a merged mapping to sum to more than the given length (default 5kb).-p[%], --map-pct-id=[%]
is the percentage identity minimum in the mapping step-n[N], --n-secondary=[N]
is the maximum number of mappings (and alignments) to report for each segment above--block-length-min
(the number of mappings for sequences shorter than the segment length is defined by-S[N], --n-short-secondary=[N]
, and defaults to 1)
By default, we obtain base-level alignments by applying a high-order version of WFA to the mappings.
Various settings affect the behavior of the pairwise alignment, but in general the alignment parameters are adjusted based on expected divergence between the mapped subsequences.
Specifying -m, --approx-map
lets us stop before alignment and obtain the approximate mappings (akin to minimap2
without -c
).
Together, these settings allow us to precisely define an alignment space to consider.
During all-to-all mapping, -X
can additionally help us by removing self mappings from the reported set, and -Y
extends this capability to prevent mapping between sequences with the same name prefix.
When working with large sequence collections we frequently use PanSN naming convention and -Y'#'
to specify that we want to group mappings by prefix, which in this context means genome or haplotype groupings.
wfmash
requires a FASTA index (.fai
) for its reference ("target"), and benefits if both reference and query are indexed.
We can build these indexes on BGZIP-indexed files, which we recommend due to their significantly smaller size.
To index your sequences, we suggest something like:
bgzip -@ 16 ref.fa
samtools faidx ref.fa.gz
Here, we apply bgzip
(from htslib
) to build a line-indexable gzip file, and then use samtools
to generate the FASTA index, which is held in 2 files:
$ ls -l ref.fa.gz*
ref.fa.gz
ref.fa.gz.gzi
ref.fa.gz.fai
Map a set of query sequences against a reference genome:
wfmash reference.fa query.fa >aln.paf
Setting a longer segment length forces the alignments to be more collinear:
wfmash -s 20k reference.fa query.fa >aln.paf
Self-mapping of sequences:
wfmash -X query.fa query.fa >aln.paf
Or just
wfmash query.fa >aln.paf
wfmash
provides a progress log that estimates time to completion.
This depends on determining the total query sequence length.
To prevent lags when starting a mapping process, users should apply samtools index
to index query and target FASTA sequences.
The .fai
indexes are then used to quickly compute the sum of query lengths.
We provide static builds of wfmash releases targeted at the x86-64-v3
instruction set.
wfmash
recipes for Bioconda are available at https://anaconda.org/bioconda/wfmash.
To install the latest version using Conda
execute:
conda install -c bioconda wfmash
The build process for wfmash
is managed using CMake
, providing various options to customize the build.
Before building wfmash
, you need the following dependencies installed on your system:
- GCC (version 9.3.0 or higher) or a recent version of Clang/LLVM
- CMake
- Zlib
- GSL
- HTSlib
- LibLZMA
- BZip2
- Threads
- OpenMP
On Ubuntu >20.04, these dependencies can be installed with the following command:
sudo apt install build-essential cmake zlib1g-dev libgsl-dev libhts-dev liblzma-dev libbz2-dev
Clone the wfmash
repository:
git clone https://github.com/waveygang/wfmash.git
cd wfmash
wfmash
provides several CMake options to customize the build process:
BUILD_STATIC
(default:OFF
): Build a static binary.BUILD_DEPS
(default:OFF
): Build external dependencies (htslib, gsl, libdeflate) from source. Use this if system libraries are not available or you want to use specific versions. HTSlib will be built without curl support, which removes a warning for static compilation related todlopen
.BUILD_RETARGETABLE
(default:OFF
): Build a retargetable binary. When this option is enabled, the binary will not include machine-specific optimizations (-march=native
).
These can be mixed and matched.
To build wfmash
using system libraries:
cmake -H. -Bbuild && cmake --build build -- -j 8
This command will configure and build wfmash
in the build
directory, using as many cores as you specify with the -j
option.
If you need to build with external dependencies, use the BUILD_DEPS
option:
cmake -H. -Bbuild -DBUILD_DEPS=ON && cmake --build build -- -j 8
This will download and build the necessary external dependencies.
To build a static binary, use the BUILD_STATIC
option:
cmake -H. -Bbuild -DBUILD_STATIC=ON && cmake --build build -- -j 16
To build a retargetable binary, use the BUILD_RETARGETABLE
option:
cmake -H. -Bbuild -DBUILD_RETARGETABLE=ON && cmake --build build -- -j 8
This will configure the build without -march=native
, allowing the binary to be run on different types of machines.
After building, you can install wfmash
using:
cmake --install build
This will install the wfmash
binary and any required libraries to the default installation directory (typically /usr/local/bin
for binaries).
To build and run tests:
cmake --build build --target test
If you need to avoid machine-specific optimizations, use the CMAKE_BUILD_TYPE=Generic
build type:
cmake -H. -Bbuild -D CMAKE_BUILD_TYPE=Generic && cmake --build build -- -j 8
The resulting binary should be compatible with all x86 processors.
To enable the functionality of emitting wavefront plots (in PNG format), tables (in TSV format), and timing information, add the -DWFA_PNG_TSV_TIMING=ON
option:
cmake -H. -Bbuild -D CMAKE_BUILD_TYPE=Release -DWFA_PNG_TSV_TIMING=ON && cmake --build build -- -j 3
Note that this may make the tool a little bit slower.
If you have nix
, you can install directly from the repository via:
nix profile install github:waveygang/wfmash
For local development, from the wfmash repo directory:
nix build .#wfmash
And you can install into your profile from the source repo with:
nix profile install .#wfmash
If you have guix
:
guix build -f guix.scm
Nix is also able to build an Docker image, which can then be loaded by Docker and converted to a Singularity image.
nix build .#dockerImage
docker load < result
singularity build wfmash.sif docker-daemon://wfmash-docker:latest
This can be run with Singularity like this:
singularity run wfmash.sif $ARGS
Where $ARGS
are your typical command line arguments to wfmash
.
First, clone the guix-genomics repository:
git clone https://github.com/ekg/guix-genomics
And install the wfmash
package to your default GUIX environment:
GUIX_PACKAGE_PATH=. guix package -i wfmash
Now wfmash
is available as a global binary installation.
Add the following to your ~/.config/guix/channels.scm
:
(cons*
(channel
(name 'guix-genomics)
(url "https://github.com/ekg/guix-genomics.git")
(branch "master"))
%default-channels)
First, pull all the packages, then install wfmash
to your default GUIX environment:
guix pull
guix package -i wfmash
If you want to build an environment only consisting of the wfmash
binary, you can do:
guix environment --ad-hoc wfmash
For more details about how to handle Guix channels, go to https://git.genenetwork.org/guix-bioinformatics/guix-bioinformatics.git.
When aligning a large number of very large sequences, one wants to distribute the calculations across a whole cluster.
This can be achieved by dividing the approximate mappings .paf
into chunks of similar difficult alignment problems using split_approx_mappings_in_chunks.py.
- We restrict
wfmash
to its approximate mapping phase.
wfmash -m reference.fa query.fa > approximate_mappings.paf
- We use the Python script to split the approximate mappings into chunks. A good approximation of the number of chunks is the number of nodes on your cluster. In the following, we assume a cluster with 5 nodes.
python3 split_approx_mappings_in_chunks.py approximate_mappings.paf 5
This gives us:
ls
approximate_mappings.paf.chunk_0.paf
approximate_mappings.paf.chunk_1.paf
approximate_mappings.paf.chunk_2.paf
approximate_mappings.paf.chunk_3.paf
approximate_mappings.paf.chunk_4.paf
- Dependent on your cluster workload manager, create a command line to submit 5 jobs to your cluster.
One example without specifying a workflow manager:
wfmash -i approximate_mappings.paf.chunk_0.paf reference.fa query.fa > approximate_mappings.paf.chunk_0.paf.aln.paf
The resulting .paf
can be directly plugged into seqwish.
# list all base-level alignment PAFs
PAFS=$(ls *.aln.paf | tr '\n' ',')
# trim of the last ','
PAFS=${PAFS::-1}
seqwish -s reference.fa -p $PAFS -g seqwish.gfa
If you have Nextflow
and Docker
or Singularity
available on your cluster, the lines above can become a one-liner:
nextflow run nf-core/pangenome -r dev --input references.fa --wfmash_only --wfmash_chunks 5
This emits a results/wfmash
folder which stores all the wfmash
output.
-
Santiago Marco-Sola, Jordan M. Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. "Optimal gap-affine alignment in O (s) space". Bioinformatics, 2023.
-
Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa "Fast gap-affine pairwise alignment using the wavefront algorithm" Bioinformatics, 2020.
-
Chirag Jain, Sergey Koren, Alexander Dilthey, Adam M. Phillippy, and Srinivas Aluru. "A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps". Bioinformatics (ECCB issue), 2018.
-
Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. "A fast approximate algorithm for mapping long reads to large reference databases." In International Conference on Research in Computational Molecular Biology, Springer, Cham, 2017.