Skip to content

MMseqs2 Release 15-6f452

Latest
Compare
Choose a tag to compare
@milot-mirdita milot-mirdita released this 31 Oct 09:22
· 88 commits to master since this release

MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.

Breaking

  • Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (2568829) Thanks @bbuchfink

New Features and Enhancements

  • Implement additional prefilter modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9)
  • Added createclusearchdb and mkrepseqdb modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458, 80f8b0b, 542f362, ad6dfc6, 91f2a6a, 8310cd6, 0019026, 76b7df1)
  • Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32e)
  • Rework ungappedprefilter to improve performance and expose additional parameters such as taxon filtering and db-load-mode to ungappedprefilter (8a89305, 800eb09, eb01b5b, 20d3afc)
  • Added gappedprefilter module for Smith-Waterman prefiltering, similar to ungappedprefilter (df77d9e)
  • Reworked pairaln for the ColabFold greedy taxonomy pairing mode (1514015)
  • Implemented experimental module for A3M filtering (167bbd1, 499bb73)
  • Implemented weighted clustering (bd080e6, b36070a, fd1837b) Thanks @AnnSeidel
  • Precomputed indices without k-mers can be created with --index-subset (314c1f0, 8fe3bf9)
  • Add result2neff module to extract Neff scores (4148e09) Thanks @neftlon
  • Add ppos format-output to convertalis for count of positive substitution scores (5edc79b) Thanks @Dohyun-s
  • Speed-up FASTA parsing in kseq.h with memchr (98406dd) Thanks @valentynbez @kloetzl

Bugfixes

  • Add min and max modes for result2stats (19dce03, 61e7734) Thanks @ClovisG
  • Fixed a segmentation fault in ca3m with the same database (f5f780a) Thanks @ClovisG
  • Fix crash when some input file sizes are an exact multiple of 4096 in convertalis and gff2db (712f288) Thanks @RuoshiZhang
  • Fixed issues for GTDB r214 database creation (4b52296) Thanks @apcamargo
  • Fix source number being limited to 16-bit (65k) (1d62fa0)
  • kseq now correctly handles input sequences larger than 2^31 bytes (07ca4a7)
  • Fixed unpackdb to work without a .lookup file and added support for writing compressed files (92d8cc3, 570e3ed)
  • createindex --check-compatible check the k-mer threshold correctly now (bb0a1b3)
  • Fixed prefilter exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55f)
  • Corrected handling of multiline checks in createdb (6b93884)
  • Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b) Thanks @AnnSeidel
  • Fixed logic in reciprocal-best-hit by removing resAB_sort (3bcbdba) Thanks @StephanieSKim
  • Corrected handling of differently ordered parts of sequence databases in concatdbs (ea17d30)
  • Fix --single-step-clustering misspelled in cluster warning (fa6c093) Thanks @valentynbez

Build and Compatibility Updates

  • Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad, 3e43617, b341b66, 932d32b) Thanks @A-N-Other
  • Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d, 05132de)
  • Updated regression testing to fix errors in MPI test (2113766)

Developer

  • Introduced base: prefix to enable inheriting subprojects to find shadowed modules (i.e. Foldseek shadows createdb, but can use base:createdb to use the MMseq2's one) (90aa913)
  • Exported build architecture in CMake so subprojects can use it (fce06b1)