MMseqs2 Release 10-6d92c
martin-steinegger
released this
23 Aug 12:05
·
1204 commits
to master
since this release
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster.
Known Issues
- High sensitivity searches (higher than -s 6) with precomputed indices should fail. Pass
--db-load-mode 3
as a workaround to the MMseqs2 call.
Breaking Changes
- Default taxonomy mode is assigning the same taxonomic label as the top hit. The previous "approximate 2bLCA" mode can be used with
--lca-mode 3
or the non-approximated 2bLCA with--lca-mode 2
- MMseqs2 will refuse to compile on compilers without OpenMP support (Use
-DREQUIRE_OPENMP=0
to force a single-threaded no OpenMP build) - The confusingly named (and probably non-functional)
--global-alignment
parameter is gone - File names of the latest precompiled binaries changed. All archives contain a copy of the user guide and the MMseqs2 binary in the same subfolder (see further down for binaries of release 10-6d92c):
SIMD | Linux | macOS | Windows |
---|---|---|---|
SSE4.1 | mmseqs-linux-sse41.tar.gz | mmseqs-osx-sse41.tar.gz | mmseqs-win64.zip |
AVX2 | mmseqs-linux-avx2.tar.gz | mmseqs-osx-avx2.tar.gz | - |
Known Issues
- MMseqs2 on Windows seems to not scale well on multiple threads
- MMseqs2 on Windows can crash when built with AVX2 support (mostly on VMs)
Features
createindex
can precompute split indices to improve runtime when searching against a database that is larger than the system memory. Precomputed databases also require less overhead RAM, since only the required parts are loadedeasy-search
,easy-taxonomy
,easy-linclust
andeasy-cluster
workflows can take any number of query FASTA or FASTQ files- MMseqs2 validates database types. It will exit with an error message on wrong input, where it would previously crash
kmermatcher
reports the diagonal with the most k-mer matcheskmermatcher
scales the number of k-mers with sequence length (--kmer-per-seq-scale
)rescorediagonal
got two new rescore modes, one for global alignment scoring and one for scoring a quasi global alignment fullfilling a local window criterion- Peak memory usage for reading in very large databases is greatly reduced. 128GB nodes should comfortably be able to deal with up to the maximum of 4.2 billion entries
- Parameters taking byte values support syntax with a SI suffix (e.g.,
--split-memory-limit 64G
) - Nucleotide substitution matrices should be user definable
- Taxonomy report is compatible with Pavian. Thanks to Florian Breitwieser!
cluster
workflow learned a reassignment mode--cluster-reassign
. This mode corrects errors that occured because of cascaded clusteringextractorfs
can directly translate a nucleotide ORF to an amino acid sequenceresult2stats
can write TSV filescreatesubdb
supports softlinks instead of always hard copying the whole file to disk- reduced harddisk space usage for all cascaded clusterings
easy-taxonomy
reports the top hit alignment as a separate output file with the suffixtophit_aln
createindex
checks if an index needs to be recomputed were improved
Bug fixes
- MMseqs2 did not compile on FreeBSD. Please let us know about free continuous integration options to make sure it will keep working in the future
proteinaln2nucl
could return wrong coordinatesapply
would deadlock when running with multiple threads- MPI searches are way more reliable, there were various issues around merging the separate results. MPI logic of split and merge is also integrated into the regression tests suite
prefilter
splits nucleotide searches if not enough memory is availablekmermatcher
could corrupt memoryrescorediagonal
could produce wrong sequence identities when aligning mixed-case sequences- macOS builds were not actually static (still dynamically link libsystem however)
lca
module could corrupt memory and crashcreatedb
does not crash on systems with only 4GB of RAM anymore- AVX2 and SSE4.1 builds could produce slightly different results
summarizeresults
does not crash on empty alignments results anymore- fix wrong tophit_report in
easy-taxonomy
- Precompiled Windows builds were broken
- Precomputed indices of databases with very short sequences could truncate alignments if the query sequences were longer
Developers
-
Tools using MMseqs2 as a framework do not need to export MMseqs2 modules again anymore
-
MMseqs2 uses Azure Pipelines for all platforms to run our regression tests suite and provide precompiled binaries
-
MMseqs2 runs under ASan without any issues. We fixed various small memory leaks
-
The regression suite is directly linked through a submodule
It can be used by running:
git submodule update --init ./util/regression/run_regression.sh $PATH_TO_MMSEQS/mmseqs $TMP_DIR