A rust FFI library for minimap2. In development! Feedback appreciated!
minimap2-sys is the raw FFI bindings to minimap2. minimap2 is the more opinionated, rusty version.
minimap2 = "0.1.22+minimap2.2.28"
Also see Features
Tested with rustc 1.82.0 and nightly. So probably a good idea to upgrade before running. But let me know if you run into pain points with older versions and will try to fix.
minimap2-rs | minimap2 |
---|---|
0.1.22 | 2.28 |
0.1.21 | 2.28 |
0.1.20 | 2.28 |
0.1.19 | 2.28 |
0.1.18 | 2.28 |
0.1.17 | 2.27 |
0.1.16 | 2.26 |
Create an Aligner
let mut aligner = Aligner::builder()
.map_ont()
.with_threads(8)
.with_cigar()
.with_index("ReferenceFile.fasta", None)
.expect("Unable to build index");
Align a sequence:
let seq: Vec<u8> = b"ACTGACTCACATCGACTACGACTACTAGACACTAGACTATCGACTACTGACATCGA";
let alignment = aligner
.map(&seq, false, false, None, None, Some(b"My Sequence Name"))
.expect("Unable to align");
All minimap2 presets should be available (see functions section):
let aligner = map_ont();
let aligner = asm20();
Note Each preset overwrites different arguments. Using multiple at a time is not technically supported, but will work. Results unknown. So be careful! It's equivalent to running minimap2 -x map_ont -x short ...
MapOpts and IdxOpts can be customized with Rust's struct pattern, as well as applying mapping settings. Inspired by bevy.
let mut aligner: Aligner<PresetSet> = Aligner::builder().map_ont();
aligner.mapopt.seed = 42;
aligner.mapopt.best_n = 1;
aligner.idxopt.k = 21;
self.mapopt.flag |= MM_F_COPY_COMMENT as i64; // Setting a flag. If you do this frequently, open an [issue](https://github.com/jguhlin/minimap2-rs/issues/new) asking for an ergonomic function!
self.idxopt.flag |= MM_I_HPC as i32;
See full list of options below.
There are two working examples directly in this repo. In both instances below, 64 is the number of threads to allocate.
Channel-based multi-threading
cargo run --example channels -- reference_genome.fasta reads.fasta 64
Rayon-based multi-threading
cargo run --example rayon -- reference_genome.fasta reads.fasta 64
Depending on your needs you can probably do just fine with Rayon. But for very large implementations, interactivity, or limited memory, using channels may be the way to go.
There is a binary called "fakeminimap2" which demonstrates basic usage and multithreading using channels or rayon. You can find it in this repo for an example. It it much more fully featured example, with an output interface, some mouse support, and interaction.
Alignment functions return a Mapping struct. The Alignment struct is only returned when the Aligner is created using .with_cigar().
A very simple example would be:
let mut file = std::fs::File::open(query_file);
let mut reader = BufReader::new(reader);
let mut fasta = Fasta::from_buffer(&mut reader)
for seq in reader {
let seq = seq.unwrap();
let alignment: Vec<Mapping> = aligner
.map(&seq.sequence.unwrap(), false, false, None, None, None)
.expect("Unable to align");
println!("{:?}", alignment);
}
There is a map_file function that works on an entire file, but it is not-lazy and thus not suitable for large files. It may be removed in the future or moved to a separate lib.
let mappings: Result<Vec<Mapping>> = aligner.map_file("query.fa", false, false);
Multithreading is supported, for implementation example see fakeminimap2. Minimap2 also supports threading itself, and will use a minimum of 3 cores for building the index. Multithreading for mapping is left to the end-user.
Adjust the number of threads used to build the index:
let mut aligner = Aligner::builder()
.map_ont()
.with_index_threads(8);
This appears to work. See fakeminimap2 for full implementation.
use rayon::prelude::*;
let results = sequences.par_iter().map(|seq| {
aligner.map(seq.as_bytes(), false, false, None, None, None).unwrap()
}).collect::<Vec<_>>();
Also works. Otherwise directly cloning the aligner will Arc clone the internal index.
The following crate features are available:
- map-file - Enables the ability to map a file directly to a reference. Enabled by deafult
- htslib - Provides an interface to minimap2 that returns rust_htslib::Records
- simde - Enables SIMD Everywhere library in minimap2
- zlib-ng - Enables the use of zlib-ng for faster compression
- curl - Enables curl for htslib
- static - Builds minimap2 as a static library
- sse2only - Builds minimap2 with only SSE2 support
Map-file is a default feature and enabled unless otherwise specified.
Create an issue if you need any of the following:
- setting mismatch penalty for base transitions minimap 2.27 release notes
- Generate ds tags to indicate uncertainty in indels
Potentially others. Please create an issue!
Follow these instructions.
In brief, using bash shell:
docker pull messense/rust-musl-cross:x86_64-musl
alias rust-musl-builder='docker run --rm -it -v "$(pwd)":/home/rust/src messense/rust-musl-cross:x86_64-musl'
rust-musl-builder cargo build --release
Minimap2 is tested on x86_64 and aarch64 (arm64). Other platforms may work, please open an issue if minimap2 compiles but minimap2-rs does not.
htslib
- Successsimde
- Success
- Chopper - Long read trimming and filtering
- mappy-rs - Drop-in multi-threaded replacement for python's mappy
- HiFiHLA - HLA star-calling tool for PacBio HiFi data
- STRdust - Tandem repeat genotyper for long reads
- oarfish - transcript quantification from long-read RNA-seq data
- lrge - Long Read-based Genome size Estimation from overlaps
- Iterator interface for map_file
- -sys Possible to decouple from pthread?
Please cite the appropriate minimap2 papers if you use this in your work, as well as this library.
... coming soon ...
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]
and/or:
Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. [doi:10.1093/bioinformatics/btab705][doi2]
See customization for how to use these.
Field Name | Type | Description |
---|---|---|
flag |
i64 |
Flags to control mapping behavior (bitwise flags). |
seed |
c_int |
Random seed for mapping. |
sdust_thres |
c_int |
Threshold for masking low-complexity regions using SDUST. |
max_qlen |
c_int |
Maximum query length. |
bw |
c_int |
Bandwidth for alignment of short reads. |
bw_long |
c_int |
Bandwidth for alignment of long reads. |
max_gap |
c_int |
Maximum gap allowed in mapping. |
max_gap_ref |
c_int |
Maximum gap allowed on the reference. |
max_frag_len |
c_int |
Maximum fragment length for paired-end reads. |
max_chain_skip |
c_int |
Maximum number of seeds to skip in chaining. |
max_chain_iter |
c_int |
Maximum number of chaining iterations. |
min_cnt |
c_int |
Minimum number of seeds required for a chain. |
min_chain_score |
c_int |
Minimum score for a chain to be considered. |
chain_gap_scale |
f32 |
Scaling factor for chain gap penalty. |
chain_skip_scale |
f32 |
Scaling factor for chain skipping. |
rmq_size_cap |
c_int |
Size cap for RMQ (Range Minimum Query). |
rmq_inner_dist |
c_int |
Inner distance for RMQ rescue. |
rmq_rescue_size |
c_int |
Size threshold for RMQ rescue. |
rmq_rescue_ratio |
f32 |
Rescue ratio for RMQ. |
mask_level |
f32 |
Level at which to mask repetitive seeds. |
mask_len |
c_int |
Length of sequences to mask. |
pri_ratio |
f32 |
Ratio threshold for primary alignment selection. |
best_n |
c_int |
Maximum number of best alignments to retain. |
alt_drop |
f32 |
Score drop ratio for alternative mappings. |
a |
c_int |
Match score. |
b |
c_int |
Mismatch penalty. |
q |
c_int |
Gap open penalty. |
e |
c_int |
Gap extension penalty. |
q2 |
c_int |
Gap open penalty for long gaps. |
e2 |
c_int |
Gap extension penalty for long gaps. |
transition |
c_int |
Penalty for transitions in spliced alignment. |
sc_ambi |
c_int |
Score for ambiguous bases. |
noncan |
c_int |
Allow non-canonical splicing (boolean flag). |
junc_bonus |
c_int |
Bonus score for junctions. |
zdrop |
c_int |
Z-drop score for alignment extension stopping. |
zdrop_inv |
c_int |
Inverse Z-drop score. |
end_bonus |
c_int |
Bonus score for aligning to the ends of sequences. |
min_dp_max |
c_int |
Minimum score to consider a DP alignment valid. |
min_ksw_len |
c_int |
Minimum length for performing Smith-Waterman alignment. |
anchor_ext_len |
c_int |
Length for anchor extension. |
anchor_ext_shift |
c_int |
Shift for anchor extension. |
max_clip_ratio |
f32 |
Maximum allowed clipping ratio. |
rank_min_len |
c_int |
Minimum length for rank filtering. |
rank_frac |
f32 |
Fraction for rank filtering. |
pe_ori |
c_int |
Expected orientation of paired-end reads. |
pe_bonus |
c_int |
Bonus score for proper paired-end alignment. |
mid_occ_frac |
f32 |
Fraction for mid-occurrence filtering. |
q_occ_frac |
f32 |
Fraction for query occurrence filtering. |
min_mid_occ |
i32 |
Minimum mid-occurrence threshold. |
max_mid_occ |
i32 |
Maximum mid-occurrence threshold. |
mid_occ |
i32 |
Mid-occurrence cutoff value. |
max_occ |
i32 |
Maximum occurrence cutoff value. |
max_max_occ |
i32 |
Maximum allowed occurrence value. |
occ_dist |
i32 |
Distribution of occurrences for filtering. |
mini_batch_size |
i64 |
Size of mini-batches for processing. |
max_sw_mat |
i64 |
Maximum size of Smith-Waterman matrices. |
cap_kalloc |
i64 |
Memory allocation cap for kalloc. |
split_prefix |
*const c_char |
Prefix for splitting output files. |
Flag Constant | Value | Description |
---|---|---|
MM_F_NO_DIAG |
1 |
Skip seed pairs on the same diagonal. |
MM_F_NO_DUAL |
2 |
Do not compute reverse complement of seeds. |
MM_F_CIGAR |
4 |
Compute CIGAR string. |
MM_F_OUT_SAM |
8 |
Output alignments in SAM format. |
MM_F_NO_QUAL |
16 |
Do not output base quality in SAM. |
MM_F_OUT_CG |
32 |
Output CIGAR in CG format (Compact CIGAR). |
MM_F_OUT_CS |
64 |
Output cs tag (difference string) in SAM/PAF. |
MM_F_SPLICE |
128 |
Enable spliced alignment (for RNA-seq). |
MM_F_SPLICE_FOR |
256 |
Only consider the forward strand for spliced alignment. |
MM_F_SPLICE_REV |
512 |
Only consider the reverse strand for spliced alignment. |
MM_F_NO_LJOIN |
1024 |
Disable long join for gapped alignment. |
MM_F_OUT_CS_LONG |
2048 |
Output cs tag in long format. |
MM_F_SR |
4096 |
Perform split read alignment (for short reads). |
MM_F_FRAG_MODE |
8192 |
Fragment mode for paired-end reads. |
MM_F_NO_PRINT_2ND |
16384 |
Do not output secondary alignments. |
MM_F_2_IO_THREADS |
32768 |
Use two I/O threads during mapping. |
MM_F_LONG_CIGAR |
65536 |
Use long CIGAR (>65535 operations). |
MM_F_INDEPEND_SEG |
131072 |
Map segments independently in multiple mapping. |
MM_F_SPLICE_FLANK |
262144 |
Add flanking bases for spliced alignment. |
MM_F_SOFTCLIP |
524288 |
Perform soft clipping at ends. |
MM_F_FOR_ONLY |
1048576 |
Only map the forward strand of the query. |
MM_F_REV_ONLY |
2097152 |
Only map the reverse complement of the query. |
MM_F_HEAP_SORT |
4194304 |
Use heap sort for mapping. |
MM_F_ALL_CHAINS |
8388608 |
Output all chains (may include suboptimal chains). |
MM_F_OUT_MD |
16777216 |
Output MD tag in SAM. |
MM_F_COPY_COMMENT |
33554432 |
Copy comment from FASTA/Q to SAM output. |
MM_F_EQX |
67108864 |
Use =/X instead of M in CIGAR. |
MM_F_PAF_NO_HIT |
134217728 |
Output unmapped reads in PAF format. |
MM_F_NO_END_FLT |
268435456 |
Disable end flanking region filtering. |
MM_F_HARD_MLEVEL |
536870912 |
Hard mask low-complexity regions. |
MM_F_SAM_HIT_ONLY |
1073741824 |
Output only alignments in SAM (no headers). |
MM_F_RMQ |
2147483648 |
Use RMQ for read mapping quality estimation. |
MM_F_QSTRAND |
4294967296 |
Consider query strand in mapping. |
MM_F_NO_INV |
8589934592 |
Disable inversion in alignment. |
MM_F_NO_HASH_NAME |
17179869184 |
Do not hash read names (for reproducibility). |
MM_F_SPLICE_OLD |
34359738368 |
Use old splice alignment model. |
MM_F_SECONDARY_SEQ |
68719476736 |
Output sequence of secondary alignments. |
MM_F_OUT_DS |
137438953472 |
Output detailed alignment score (ds tag). |
Field Name | Type | Description |
---|---|---|
k |
c_short |
K-mer size (mer length). |
w |
c_short |
Minimizer window size. |
flag |
c_short |
Flags to control indexing behavior (bitwise flags). |
bucket_bits |
c_short |
Number of bits for the size of hash table buckets. |
mini_batch_size |
i64 |
Size of mini-batches for indexing (number of bases). |
batch_size |
u64 |
Total batch size for indexing (number of bases). |
Flag Constant | Value | Description |
---|---|---|
MM_I_HPC |
1 |
Use homopolymer-compressed k-mers for indexing. |
MM_I_NO_SEQ |
2 |
Do not store sequences in the index. |
MM_I_NO_NAME |
4 |
Do not store sequence names in the index. |
- Fixed a memory segfault when re-using a thread local buffer. Not sure why it occurs, but this fix seems to solve it.
Contributors to this release: @mbhall88 @rob-p @Sam-Sims @charlesgregory @PB-DB
- Map now returns Arc String's to reduce memory allocation for large and/or repetitive jobs
- map now takes an additional argument, query_name: Option<&[u8]>, possibly solves #75 (@rob-p @mbhall88 @jguhlin)
- Arc the Index, to prevent double-frees, solves #71
- Map file now passes in query name, which should help with #75
- Supplementary flag now better detected (@rob-p)
- FIX: Cigar string missing softclip operation (@Sam-Sims)
- Make all setting changes before calling a with_index, with_seq's function
- Change all map calls to include a query name, or None if not needed
- Add ergonomic functions n_seq and get_seq.
- Better docs on applying presets, solves #84
- Better detection of target arch c_char's and ptr's, solves #82
- Support for M1 Mac compilation and addition of github workflows to test it, solving #81
- Rayon test, so some support, closes #5
- Static str's and now static CStr's
- FIX: memory leak due to sequences allocated by minimap2 not being freed @charlesgregory
- Add Send + Sync to Aligner, along with unit test @PB-DB
- Experimental Android support (tested on aarch64 and x86_64), solves #66
- Added flag and option documents
- Added with_gap_open penalty ergonomic function
- Fix htslib errors. No update to -sys crate needed.
- Fix memory leak by @charlesgregory
- Update to minimap2 v2.28 @jguhlin
- Support for lrhqae preset @jguhlin
- Mark bam::Record objects as supplementary. #52 @PB-DB
- Only use rust-htslib/curl when curl feature is enabled. #53 @PB-DB
- Update to minimap2 v2.27 @jguhlin
- Switch to needletail for reading fast files (features map-file) @jguhlin
- Convert functions to take slices of vectors instead of refs to vecs
&[Vec<u8>]
instead of&Vec<Vec<u8>>
@jguhlin - breaking Curl is no longer a default option for htslib, please re-enable it as needed with cargo.toml features
- breaking Now using needletail for map-files, enabled by default. However, compression algorithms are disabled. Please enable with cargo.toml features
- Experimental rayon support
- aligner.with_cigar_clipping() to add soft clipping to the CIGAR vec (with_cigar() still adds to only the string, following the minimap2 outputs for PAF)
- breaking .with_threads() is now .with_index_threads() to make it more clear
- Much better cross compilation support thanks to @Adoni5
- Compilation on aarch64 thanks to @leiste375
- README corrections thanks to @wdecoster
- Better support for static builds / linking
- Update fffx to a version that uses bytelines without tokio. Drastically reduces compile times and dependency tree.
- Memory leak fixed by @Adoni5
- Updated deps
- Add with_seq to support indexing a single sequence (as per mappy: https://github.com/lh3/minimap2/blob/master/python/mappy.pyx#L115)
- minimap2-rs: update rust-htslib deps
- simdutf8 now optional dependency requiring map-file feature to be enabled
- Support soft-clipping string in CIGAR. WARNING: Does not support hard clipping. Please open an issue if you need this.
- Update minimap to 2.26
- Not convinced SSE41/SSE2 are working properly. Recommend simde.
- HTS lib: add support for optional quality scores by @eharr
- HTS lib support by @eharr
- HTS lib: Output sam/bam files by @eharr
- More tests by @eharr
- Display impl for Strand thanks to @ahcm
- Update minimap2-sys to latest version by @jguhlin
- -sys crate mm2fast added as additional backend by @jguhlin
- zlib dep changes by @jguhlin (hopefully now it is more portable and robust)
- -sys crate now supports SIMDe
- Thanks for @Adoni5 for switching to builder pattern, and @eharr for adding additional fields to alignment.
- Do not require libclang for normal compilation.
- Multithreading support (use less raw pointers, and treat more like rust Struct's)
- use libc instead of std:ffi::c_int as well
- Support slightly older versions of rustc by using libc:: rather than std::ffi for c_char (Thanks dwpeng!)
- Use fffx module for fasta/q parsing