MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1
). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.
Breaking
- Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (2568829) Thanks @bbuchfink
New Features and Enhancements
- Implement additional
prefilter
modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9) - Added
createclusearchdb
andmkrepseqdb
modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458, 80f8b0b, 542f362, ad6dfc6, 91f2a6a, 8310cd6, 0019026, 76b7df1) - Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32e)
- Rework
ungappedprefilter
to improve performance and expose additional parameters such as taxon filtering and db-load-mode toungappedprefilter
(8a89305, 800eb09, eb01b5b, 20d3afc) - Added
gappedprefilter
module for Smith-Waterman prefiltering, similar toungappedprefilter
(df77d9e) - Reworked
pairaln
for the ColabFold greedy taxonomy pairing mode (1514015) - Implemented experimental module for A3M filtering (167bbd1, 499bb73)
- Implemented weighted clustering (bd080e6, b36070a, fd1837b) Thanks @AnnSeidel
- Precomputed indices without k-mers can be created with
--index-subset
(314c1f0, 8fe3bf9) - Add
result2neff
module to extract Neff scores (4148e09) Thanks @neftlon - Add
ppos
format-output toconvertalis
for count of positive substitution scores (5edc79b) Thanks @Dohyun-s - Speed-up FASTA parsing in
kseq.h
with memchr (98406dd) Thanks @valentynbez @kloetzl
Bugfixes
- Add min and max modes for
result2stats
(19dce03, 61e7734) Thanks @ClovisG - Fixed a segmentation fault in ca3m with the same database (f5f780a) Thanks @ClovisG
- Fix crash when some input file sizes are an exact multiple of 4096 in
convertalis
andgff2db
(712f288) Thanks @RuoshiZhang - Fixed issues for GTDB r214 database creation (4b52296) Thanks @apcamargo
- Fix source number being limited to 16-bit (65k) (1d62fa0)
kseq
now correctly handles input sequences larger than 2^31 bytes (07ca4a7)- Fixed
unpackdb
to work without a.lookup
file and added support for writing compressed files (92d8cc3, 570e3ed) createindex --check-compatible
check the k-mer threshold correctly now (bb0a1b3)- Fixed
prefilter
exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55f) - Corrected handling of multiline checks in
createdb
(6b93884) - Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b) Thanks @AnnSeidel
- Fixed logic in reciprocal-best-hit by removing
resAB_sort
(3bcbdba) Thanks @StephanieSKim - Corrected handling of differently ordered parts of sequence databases in
concatdbs
(ea17d30) - Fix
--single-step-clustering
misspelled in cluster warning (fa6c093) Thanks @valentynbez
Build and Compatibility Updates
- Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad, 3e43617, b341b66, 932d32b) Thanks @A-N-Other
- Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d, 05132de)
- Updated regression testing to fix errors in MPI test (2113766)