Skip to content

MMseqs2 Release 8-fac81

Compare
Choose a tag to compare
@martin-steinegger martin-steinegger released this 01 Apr 01:18
· 1631 commits to master since this release
fac81fa

At a glance: Faster searches and clustering through improved IO and better seeding. More search modes like tblastx, reciprocal best hit and linsearch. New output format SAM. Support for compressed databases to reduce hard disk and memory requirements.

Known Issues

  • Iterative search only works up to 2 iterations

Breaking Changes

  • MMseqs2 now saves a lot on IO by not merging result datafiles
    There is still a single .index file, but the corresponding data files are split into multiple parts (as many as threads were used previously)
  • MMseqs2 now uses the VTML80 [1,2] substitution matrix to speed up the prefiltering (changeable by --seed-sub-mat), the final alignment is still computed with the Blosum62 (still changeable by --sub-mat)
  • All databases have now a .dbtype file
  • MMseqs2 Docker image is now based on Debian instead of Alpine
  • Changed Orf header format to be more space efficent. The new format is now orignIdentifer startPos(-/+)len flag
  • prefilter returns ungapped-alignment scores instead of e-values
  • createindex the file extention is now .idx instead of the previous .[s]k[6,7] format

Features

  • Support for tblastx-style nucl-nucl translated searches
    mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 2
  • Support for nucleotide searches
    mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 3
  • convertalis has learned to return SAM formatted output (preview)
  • Database can be compressed by applying zstd on each entry (--compressed 1)
    • Also added compress and decompress modules
  • rbh workflow for reciprocal best hit searches added
  • linclust can now cluster nucleotide sequences on both forward and reverse strand
  • Added linsearch, a lightning fast search for proteins and nucleotide sequences (preview; easy workflow variant easy-linsearch also added)
  • createlinindex computes an index for linsearch
  • taxonomy uses --orf-start-mode 1 to annotate more sequences
  • Added approx. 2bLCA to speed up computation, this is now the new default. The old mode can be turned on by --lca-mode 2
  • createdb recognizes sequences containing Uracil as DNA sequences
  • createdb is now faster through speeding up its shuffle operations
  • view module to view single entry in an MMseqs2 database
  • align module has learned --min-aln-len parameter to filter by minimal alignment length
  • Alignment modules (rescorediagonal, align) can align longer sequences now (not limited to 2^15 length)
  • Input sequences can now be softmasked (lower letter masking) instead of only hard masking (replacing with X) ``--mask-lower-case. The masking only applies to the prefilter stages kmermatcher` or `prefilter` and can be combined with `--mask`
  • filterdb has learned --filter-expression parameter and mode that allows filtering by simple mathematical expressions
  • alignbykmer can be used for nucleotide searches
  • MMseqs2 did-you-mean functionality gives better suggestions
  • MMseqs2 does not repeat the whole parameter list for each submodule call anymore

Bugs

  • Default parameters of map workflow are now set correctly
  • Some modules were using the wrong coverage parameter
  • Sliced profile search was losing high E-value hits
  • Sliced profile search is now stable
  • Profile-Sequence alignment E-values where slightly too high
  • result2msa was crashing with profiles on the target side
  • result2msa should not crash with --alow-deletion anymore
  • Some parameters were never visible (with or without -h)
  • Various issues with MPI were resolved

Developers

  • Continous integration enforces no compile warnings now
  • Continous integration now tries to build AArch64 builds with Docker and Qemu
  • We added a first draft of our developer guide to the wiki

References

[1] Müller T & Martin Vingron, Modeling Amino Acid Replacement, J Comput Biol. 2000;7:761–76. doi: 10.1089/10665270050514918.

[2] Müller T, Spang R, Vingron M. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002;19:8–13. doi: 10.1093/oxfordjournals.molbev.a003985