Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flagtastrophe #290

Merged
merged 96 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
25a06b0
fix: Update CLI interface for wfmash
ekg Nov 1, 2024
49ac285
fix: feat: Add missing variable declarations and remove duplicate opt…
ekg Nov 1, 2024
1948c6b
feat: remove duplicate no_merge flag declaration in parse_args.hpp
ekg Nov 1, 2024
de74a57
fix: remove duplicate approx_mapping flag and declare kmer_size
ekg Nov 1, 2024
8135f96
fix: Add approx_mapping flag to parse_args.hpp
ekg Nov 1, 2024
2b19ea5
fix: change short option for approx_mapping to -m
ekg Nov 1, 2024
b94feca
feat: Add short option -k for kmer-size and -w for sketch-size
ekg Nov 1, 2024
f4586b3
Here is the one-line commit message based on the provided diffs:
ekg Nov 1, 2024
b22faa7
feat: add num_mappings_for_short_seq, lower_triangular, and skip_self…
ekg Nov 1, 2024
4e48999
fix: Remove unused CLI arguments and set default values directly
ekg Nov 1, 2024
9a32fbd
fix: Fix SEARCH/REPLACE block to exactly match src/interface/parse_ar…
ekg Nov 1, 2024
86948d7
feat: change short option for --one-to-one from -4 to -o
ekg Nov 1, 2024
e43ac7d
fix: remove drop-low-id parameter
ekg Nov 1, 2024
2a1b69e
feat: add default parameter for keep_low_pct_id
ekg Nov 1, 2024
4942f70
fix: Update help text formatting
ekg Nov 1, 2024
20ae047
feat: Reorganize help text in parse_args.hpp
ekg Nov 1, 2024
34be435
fix: Add missing options_group to argument parser
ekg Nov 1, 2024
d2ae928
feat: Add improved argument parser formatting
ekg Nov 1, 2024
0cf4dc5
fix: remove invalid `usageindent` parameter from `parse_args.hpp`
ekg Nov 1, 2024
842fda1
feat: Add version and help flags to system options group
ekg Nov 1, 2024
69a1d24
feat: Remove unnecessary indentation from wfmash help output
ekg Nov 1, 2024
17cd72a
feat: Adjust help text formatting to match minimap2's style
ekg Nov 1, 2024
40bc949
fix: remove duplicate "Options:" header
ekg Nov 1, 2024
904cd04
feat: Improve help text conciseness
ekg Nov 1, 2024
9d314ec
fix: remove extra "wfmash" line from help output
ekg Nov 1, 2024
99b7574
feat: Add lower triangular option to mapping parameters
ekg Nov 1, 2024
cb189e4
feat: compress query prefix option text
ekg Nov 1, 2024
d697002
fix: Update default group prefix and help text
ekg Nov 1, 2024
944d984
feat: add default group prefix character to help text
ekg Nov 1, 2024
4a876b8
feat: use shorter param name for WFA scoring
ekg Nov 1, 2024
148091a
feat: update hypergeometric filter help text
ekg Nov 1, 2024
a382bac
feat: Update hypergeometric filter parameter names in help text
ekg Nov 1, 2024
07ba04d
feat: update parameter name to "ani-Δ" in help text
ekg Nov 1, 2024
af1b7d8
feat: update help text for group prefix option to be concise
ekg Nov 1, 2024
c37b8e9
feat: Add -W short option for --write-index
ekg Nov 1, 2024
9e86feb
feat: Improve help text formatting for sequence files
ekg Nov 1, 2024
1f9d9b2
clarify fasta input usage
ekg Nov 1, 2024
c4a20ad
feat: Replace --index with --read-index for symmetry with --write-index
ekg Nov 4, 2024
9232f64
feat: add input-mapping option to alignment section
ekg Nov 4, 2024
fad3c31
feat: Add validation for segment length, block length, and max mappin…
ekg Nov 4, 2024
34ce1c3
test: add input mapping functionality test
ekg Nov 5, 2024
94b1c7c
feat: restrict mapping to S288C and SK1 strains to speed up test
ekg Nov 5, 2024
db8e502
fix: Enforce required target.fa argument in wfmash
ekg Nov 5, 2024
842989c
feat: update mashmap version to 3.5.0
ekg Nov 5, 2024
ece6bcd
build: update mapping test command
ekg Nov 5, 2024
4c72d4f
feat: add compact parameter display format
ekg Nov 5, 2024
c777bee
more options cleanup
ekg Nov 5, 2024
59630f1
fix: replace all '[mashmap]' with '[wfmash]'
ekg Nov 5, 2024
3cea6c3
refactor: Separate mapping and alignment checks in test workflow
ekg Nov 5, 2024
ff18381
fix: Move version flag handling before argument validation
ekg Nov 5, 2024
d8efee0
feat: Add WFMASH_GIT_VERSION to SAM header
ekg Nov 5, 2024
1ae4ab6
feat: Add wflign_git_version.hpp and update generate_dataset.sh
ekg Nov 5, 2024
b8892ff
feat: Add wflign directory to git version generation script
ekg Nov 5, 2024
460dbf4
fix: add command line tag to SAM header version
ekg Nov 5, 2024
bccd311
fix: Update SAM header version string
ekg Nov 5, 2024
df16d0b
fix: replace WFMASH_GIT_VERSION with WFLIGN_GIT_VERSION in wflign_pat…
ekg Nov 5, 2024
9a2612b
feat: Update SAM header version in computeAlignments.hpp
ekg Nov 5, 2024
294ae37
fix: Update logging messages to consistently use '[wfmash::mashmap]'
ekg Nov 5, 2024
b3d0423
chore: update logging messages to use '[wfmash::mashmap]' prefix
ekg Nov 5, 2024
6e406c0
refactor: replace 'mashmap::skch' with 'wfmash::mashmap'
ekg Nov 5, 2024
25086a2
fix: replace search/replace blocks in src/map/include/computeMap.hpp
ekg Nov 5, 2024
0572f55
fix: replace instances of mashmap::mapQuery with wfmash::mashmap
ekg Nov 5, 2024
c106c09
fix: replace 'wfmash::map' with 'wfmash::mashmap'
ekg Nov 5, 2024
721d8fe
fix: compress logging messages in winsketch and main.cpp
ekg Nov 5, 2024
e453081
fix: correct SEARCH/REPLACE block in winSketch.hpp
ekg Nov 5, 2024
ae9ef3d
fix: replace remaining 'mashmap::skch' with 'wfmash::mashmap'
ekg Nov 5, 2024
f4b5730
fix: update SEARCH/REPLACE block to match existing lines in src/map/i…
ekg Nov 5, 2024
247583d
chore: remove initialization messages
ekg Nov 5, 2024
e5f8821
feat: combine sequence and hash/window stats, simplify index computat…
ekg Nov 5, 2024
2cb57a5
feat: Update README to reflect current MashMap 3.5 and WFA usage
ekg Nov 5, 2024
7a30e28
feat: Update README with concise and clear description
ekg Nov 5, 2024
89421e4
feat: Add explanation for mapping length limits in README
ekg Nov 5, 2024
de3eb79
feat: Add length of target subsets in base pairs to output
ekg Nov 5, 2024
3406ccd
fix: add debug output to track sequence size metrics
ekg Nov 5, 2024
8ced912
build: Fix compilation error in parse_args.hpp
ekg Nov 5, 2024
de71486
fix: Update reference size calculation in parse_args.hpp
ekg Nov 5, 2024
7f7c015
chore: comment out debugging output in parse_args.hpp
ekg Nov 5, 2024
4ee3d60
feat: Add target sequence information to log output
ekg Nov 5, 2024
db6ffac
fix: Move logging information before mapping indexing
ekg Nov 5, 2024
7609573
feat: Add logging of target sequence information before indexing
ekg Nov 5, 2024
057a033
unaider
ekg Nov 5, 2024
80f1d0d
feat: Compute and display query and target lengths separately
ekg Nov 5, 2024
49f5182
fix: initialize sequence names before calculating lengths
ekg Nov 5, 2024
3c3d9f8
feat: Move sequence length reporting to after sequence manager initia…
ekg Nov 5, 2024
6bd8b4c
build: Initialize sequence names and calculate lengths in constructor
ekg Nov 5, 2024
9b56f32
build: Improve output conciseness and remove duplicate line
ekg Nov 5, 2024
a95273c
fix: Replace "seqs" with "queries" in output messages
ekg Nov 5, 2024
5e2e245
chore: remove detailed statistics line
ekg Nov 5, 2024
e9f0b07
fix: Set default group prefix delimiter to '#'
ekg Nov 5, 2024
0ccae8d
fix: Remove warning about single file all-vs-all mapping
ekg Nov 5, 2024
804cbde
fix: Set default group prefix delimiter to '#' and enable prefix skip…
ekg Nov 5, 2024
c14658f
feat: add group information to query and target count output
ekg Nov 5, 2024
23bb4c0
refactor: split long log message into multiple lines
ekg Nov 5, 2024
dedb843
fix: Add check for .fai index files before parameter validation
ekg Nov 6, 2024
3653365
Revert "fix: Add check for .fai index files before parameter validation"
ekg Nov 7, 2024
0e123bd
feat: Add parallel index building to improve performance
ekg Nov 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/workflows/test_on_push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,20 @@ jobs:
run: ASAN_OPTIONS=detect_leaks=1:symbolize=1 LSAN_OPTIONS=verbosity=0:log_threads=1 build/bin/wfmash data/reference.fa.gz data/reads.500bps.fa.gz -s 0.5k -N -a > reads.500bps.sam && samtools view reads.500bps.sam -bS | samtools sort > reads.500bps.bam && samtools index reads.500bps.bam && samtools view reads.500bps.bam | head
- name: Test mapping+alignment with short reads (255bps) (PAF output)
run: ASAN_OPTIONS=detect_leaks=1:symbolize=1 LSAN_OPTIONS=verbosity=0:log_threads=1 build/bin/wfmash data/reads.255bps.fa.gz -w 16 -s 100 -L > reads.255bps.paf && head reads.255bps.paf
- name: Test input mapping functionality
run: |
# First generate mappings
ASAN_OPTIONS=detect_leaks=1:symbolize=1 LSAN_OPTIONS=verbosity=0:log_threads=1 build/bin/wfmash data/scerevisiae8.fa.gz -p 95 -T S288C -Q SK1 -m >mappings.paf
# Then align using the mappings
ASAN_OPTIONS=detect_leaks=1:symbolize=1 LSAN_OPTIONS=verbosity=0:log_threads=1 build/bin/wfmash data/scerevisiae8.fa.gz -i mappings.paf > aligned.paf
# Count lines in alignment file
ALIGN_LINES=$(wc -l < aligned.paf)
if [ $ALIGN_LINES -eq 0 ]; then
echo "ERROR: Alignment file is empty"
exit 1
else
echo "Found $ALIGN_LINES alignments"
fi
- name: Install Rust and Cargo
uses: actions-rs/toolchain@v1
with:
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -189,4 +189,4 @@ install(TARGETS wfa2_static
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})

file(MAKE_DIRECTORY ${CMAKE_SOURCE_DIR}/include)
execute_process(COMMAND bash ${CMAKE_SOURCE_DIR}/scripts/generate_git_version.sh ${CMAKE_SOURCE_DIR}/src)
execute_process(COMMAND bash ${CMAKE_SOURCE_DIR}/scripts/generate_git_version.sh ${CMAKE_SOURCE_DIR}/src ${CMAKE_SOURCE_DIR}/src/common/wflign/src)
28 changes: 8 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,17 @@ _**a pangenome-scale aligner**_
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](https://anaconda.org/bioconda/wfmash)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6949373.svg)](https://doi.org/10.5281/zenodo.6949373)

`wfmash` is an aligner for pangenomes based on sparse homology mapping and wavefront inception.
`wfmash` is an aligner for pangenomes that combines efficient homology mapping with base-level alignment. It uses MashMap 3.5 to find approximate mappings between sequences, then applies WFA (Wave Front Alignment) to obtain base-level alignments.

`wfmash` uses a variant of [MashMap](https://github.com/marbl/MashMap) to find large-scale sequence homologies.
It then obtains base-level alignments using [WFA](https://github.com/smarco/WFA2-lib), via the [`wflign`](https://github.com/waveygang/wfmash/tree/master/src/common/wflign) hierarchical wavefront alignment algorithm.
`wfmash` is designed to make whole genome alignment easy. On a modest compute node, whole genome alignments of gigabase-scale genomes should take minutes to hours, depending on sequence divergence. It can handle high sequence divergence, with average nucleotide identity between input sequences as low as 70%.

`wfmash` is designed to make whole genome alignment easy. On a modest compute node, whole genome alignments of gigabase-scale genomes should take minutes to hours, depending on sequence divergence.
It can handle high sequence divergence, with average nucleotide identity between input sequences as low as 70%.
`wfmash` is the key algorithm in [`pggb`](https://github.com/pangenome/pggb) (the PanGenome Graph Builder), where it is applied to make an all-to-all alignment of input genomes that defines the base structure of the pangenome graph. It can scale to support the all-to-all alignment of hundreds of human genomes.

`wfmash` is the key algorithm in [`pggb`](https://github.com/pangenome/pggb) (the PanGenome Graph Builder), where it is applied to make an all-to-all alignment of input genomes that defines the base structure of the pangenome graph.
It can scale to support the all-to-all alignment of hundreds of human genomes.
## Process

## process
By default, `wfmash` breaks query sequences into non-overlapping segments (default: 1kb) and maps them using MashMap. Consecutive mappings separated by less than the chain gap (default: 2kb) are merged. Mappings are limited to 50kb in length by default, which allows efficient base-level alignment using WFA. This length limit is important because WFA's computational complexity is quadratic in the number of differences between sequences, not their percent divergence - meaning longer sequences with the same divergence percentage require dramatically more compute time.

Each query sequence is broken into non-overlapping pieces defined by `-s[N], --segment-length=[N]`.
These segments are then mapped using MashMap's mapping algorithm.
Unlike MashMap, `wfmash` merges aggressively across large gaps, finding the best neighboring segment up to `-c[N], --chain-gap=[N]` base-pairs away.

Each mapping location is then used as a target for alignment using the wavefront inception algorithm in `wflign`.
The resulting alignments always contain extended CIGARs in the `cg:Z:*` tag.
Approximate mappings can be obtained with `-m, --approx-map`.

Sketching, mapping, and alignment are all run in parallel using a configurable number of threads.
The number of threads must be set manually, using `-t`, and defaults to 1.
For longer sequences, use `-m/--approx-mapping` to get approximate mappings only, which allows working with much larger segment and mapping lengths.

## usage

Expand Down Expand Up @@ -85,10 +73,10 @@ Map a set of query sequences against a reference genome:
wfmash reference.fa query.fa >aln.paf
```

Setting a longer segment length forces the alignments to be more collinear:
For mapping longer sequences without alignment, use -m with larger segment and max length values:

```sh
wfmash -s 20k reference.fa query.fa >aln.paf
wfmash -m -s 50k -P 500k reference.fa query.fa >mappings.paf
```

Self-mapping of sequences:
Expand Down
5 changes: 5 additions & 0 deletions scripts/generate_git_version.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
INC_DIR=$1
WFLIGN_DIR=$2

# Go to the directory where the script is
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"

GIT_VERSION=$(git describe --always --tags --long)

# Write main wfmash version header
echo "#define WFMASH_GIT_VERSION" \"$GIT_VERSION\" > "$INC_DIR"/wfmash_git_version.hpp

# Write wflign version header
echo "#define WFLIGN_GIT_VERSION" \"$GIT_VERSION\" > "$WFLIGN_DIR"/wflign_git_version.hpp
2 changes: 1 addition & 1 deletion src/align/include/computeAlignments.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ void write_sam_header(std::ofstream& outstream) {
});
}
}
outstream << "@PG\tID:wfmash\tPN:wfmash\tVN:0.1\tCL:wfmash\n";
outstream << "@PG\tID:wfmash\tPN:wfmash\tVN:" << WFMASH_GIT_VERSION << "\tCL:wfmash\n";
}

void writer_thread(const std::string& output_file,
Expand Down
4 changes: 3 additions & 1 deletion src/common/wflign/src/wflign_patch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include <atomic_image.hpp>
#include "rkmh.hpp"
#include "wflign_patch.hpp"
#include "wflign_git_version.hpp"

namespace wflign {

Expand Down Expand Up @@ -1939,7 +1940,8 @@ query_start : query_end)
out << "\t" << "cg:Z:" << cigarv << "\n";
#endif
} else {
out << query_name // Query template NAME
out << "@PG\tID:wfmash\tPN:wfmash\tVN:" << WFLIGN_GIT_VERSION << "\tCL:wfmash\n"
<< query_name // Query template NAME
<< "\t" << (query_is_rev ? "16" : "0") // bitwise FLAG
<< "\t" << target_name // Reference sequence NAME
<< "\t"
Expand Down
8 changes: 3 additions & 5 deletions src/interface/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,19 +52,18 @@ int main(int argc, char** argv) {
auto t0 = skch::Time::now();

if (map_parameters.use_spaced_seeds) {
std::cerr << "[wfmash::map] Generating spaced seeds" << std::endl;
std::cerr << "[wfmash::mashmap] Generating spaced seeds..." << std::endl;
uint32_t seed_weight = map_parameters.spaced_seed_params.weight;
uint32_t seed_count = map_parameters.spaced_seed_params.seed_count;
float similarity = map_parameters.spaced_seed_params.similarity;
uint32_t region_length = map_parameters.spaced_seed_params.region_length;

ales::spaced_seeds sps = ales::generate_spaced_seeds(seed_weight, seed_count, similarity, region_length);
std::chrono::duration<double> time_spaced_seeds = skch::Time::now() - t0;
std::cerr << "[wfmash::map] Time spent generating spaced seeds " << time_spaced_seeds.count() << " seconds" << std::endl;
map_parameters.spaced_seed_sensitivity = sps.sensitivity;
map_parameters.spaced_seeds = sps.seeds;
ales::printSpacedSeeds(map_parameters.spaced_seeds);
std::cerr << "[wfmash::map] Spaced seed sensitivity " << sps.sensitivity << std::endl;
std::cerr << "[wfmash::mashmap] Generated spaced seeds in " << time_spaced_seeds.count() << "s (sensitivity: " << sps.sensitivity << ")" << std::endl;
}

//Map the sequences in query file
Expand All @@ -73,8 +72,7 @@ int main(int argc, char** argv) {
skch::Map mapper = skch::Map(map_parameters);

std::chrono::duration<double> timeMapQuery = skch::Time::now() - t0;
std::cerr << "[wfmash::map] time spent mapping the query: " << timeMapQuery.count() << " sec" << std::endl;
std::cerr << "[wfmash::map] mapping results saved in: " << map_parameters.outFileName << std::endl;
std::cerr << "[wfmash::mashmap] Mapped query in " << timeMapQuery.count() << "s, results saved to: " << map_parameters.outFileName << std::endl;

if (yeet_parameters.approx_mapping) {
return 0;
Expand Down
Loading
Loading