Releases: marbl/MashMap
v3.1.3
Previously, the -f one-to-one filter
plane-sweep filter was applied to all mappings at the same time. In cases where users are mapping multiple query genomes to one or more target sequences with the --skipPrefix #
flag, the one-to-one filter would treat all query sequences as part of the same genome/group.
This patch makes it so that the one-to-one plane-sweep filter is applied to each pair of query and reference groups independently, ensuring that -n
mappings are retained for each pair. A "group" of sequences is the set of sequences which contain the same prefix up until the last occurrence of the character c
, where --skipPrefix c
is specified.
MashMap v3.1.2
- #64 - Fixes a bug where the
kc
tag was incorrect for chained mappings
MashMap v3.1.1
- In order to maintain a default build that requires the same libraries as previous versions, building with htslib is now optional.
- To build with htslib, call
cmake
with-DUSE_HTSLIB=ON
- htslib is useful for the
--targetList
and--targetPrefix
options, which allow users to only index specific contigs from a reference fasta file.
MashMap v3.1.0
- When filtering matches, the "score" of a match no longer takes into account the length. Previously, the score for a mapping was
len*ANI
, meaning that a 1000bp mapping with 100% identity would be tossed out in favor of a 1112bp mapping with 90% identity. - Fixes a rare bug that caused a crash when the very first minmer in the index is a hit.
- Fixes bug with
--kmerThreshold
CLI option which ignored users' argument in favor of1
. - Low complexity segments are tossed out before stage 1 mapping.
- Mappings use 32-bit integers to store positions now instead of 64-bit integers. If you need mashmap to work with contigs larger than 2^31, you can pass
-DLARGE_CONTIG=1
to CMake when building. - Reads shorter than the block length are now split, instead of being aligned in one piece.
- Added
--targetPrefix
and--targetList
CLI options, which allow the users to specify subsets of the reference file to be indexed. Requires htslib! - Added
--lowerTriangular
CLI option which only computes mappings between sequence i and sequence j if i > j (meant to be used when reference and query files are identical). - Limits the size of the DP filter so that large sketch sizes don't incur a huge setup time.
MashMap v3.0.6
Changelog:
- Uses the chaining algorithm from
wfmash
--splitPrefix
now performs the filtering on each prefix-group independently.- Does not sketch or winnow k-mers w/ ambiguous nucleotides.
- Added
kc:f
tag for the estimated k-mer complexity, defined as the ratio of the estimated number of distinct k-mers in a segment to the total number of k-mers in a segment (this estimate can be greater than1.0
). - Added a flag
--kmerComplexity x
to filter out segments with estimated kmer complexity less thanx
. - Mapping progress now updates for each segment mapped, as opposed to each contig mapped.
- Added
--reportPercentage
option to report ANI as a percentage instead of in[0, 1]
range (necessary for use w/ wfmash) - Fixes #54
- Does not split sequences smaller than the block length.
- Added address sanitizer for
Debug
build
MashMap v3.0.5
Changelog:
- Removed sanity-check filters that were actually dropping desired mappings
- Sort query minmers upon recruitment using the heap as opposed to sorting for every stage 1 hit
- Add -DPROFILE flag to compile w/ debug symbols and no inline (also removed inline keyword from some functions)
- Cast jaccard to float now that it is no longer multiplied by 100.0.
MashMap v3.0.4
- Add
--legacy
flag for MashMap2 style output - Add
-v
/--version
flag - Output
id
andjc
tags in [0,1] range instead of [0, 100] - Improves stderr header output
Conda package (and MacOS binary) to be added once Bioconda CI is fixed
MashMap v3.0.2
- Clarified block-length help string
- Fixed bug for block-length filter
- Removed some optimization flags
MashMap v3.0.1
MashMap3 Changelog
-
Instead of indexing locations of minimizers, we track indexing of windows for which a k-mer is one of the lowest
s
hashes in the window wheres
is the sketch size. These k-mers are termed "minmers." -
The first-pass filtering stage computes the number of shared minmers for each candidate mapping in linear time. Regions with significantly high counts of shared minmers are passed on to stage 2.
-
The second stage of filtering, where the minhash score of each mapping in the candidate region is calculated, uses a
std::vector
to keep track of the rolling minhash score as opposed to thestd::map
used in MashMap2. The details can be seen inslidingMap.hpp
. -
While the mapping stage is faster, particularly for lower ANI cutoffs (90% and below), the indexing stage does require a bit more time than before. To avoid spending time recomputing the index, users can save the index via
--saveIndex PREFIX
, and then reuse it in a later run with--loadIndex PREFIX
. -
The default parameter for the sketch size depends on the value of the minimum ANI threshold (
pi
) and the segment length (L
). Decreasing the sketch size will decrease runtime in a linear fashion at the cost of increasing the variance in the ANI estimation error. -
Frequent seeds are filtered out based on how many minmer-intervals they have as opposed to how many times the kmer actually occurs in the reference. This adds some noise to frequent-kmer filtering, as its possible for a less frequent kmer to have more intervals than a more frequent kmer.
-
The binomial model is used to estimate ANI from Jaccard instead of the Poisson model.
-
k-mer size is no longer limited to
<=16
, as the hash values are 64 bits instead of 32 bits. The default kmer size is now19
. -
Numerous interface updates were copied over from
wfmash
, including a progress meter and usage of the samtools.fai
index. -
The output of MashMap3 is now in PAF format, with
id
andjc
tags which represent the estimated ANI and the estimated Jaccard similarity, respectively. Thejc
tag is only present for mappings where chaining is disabled. -
There is now an option for significantly denser sketching,
--dense
MashMap v2.0
Now generalized for computing approximate local alignments between long DNA sequences. This will be useful for fast genome to genome mapping or split-read mapping of long reads.