MMseqs2 Release 11-e1a1c
martin-steinegger
released this
11 Feb 22:31
·
918 commits
to master
since this release
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases
module helps to download and setup database. We now have a chat support at chat.mmseqs.com.
Known Issues
rbh
crashes due to invalid sorting mode (#290)- Homebrew's macOS version does not use multiple cores (#289)
prefilter
results can be unstable between different runs for extremely redundant databases (#277)linclust
/cluster
can crash for very small input sets (#274)
Breaking Changes
kmermatcher
--skip-n-repeat-kmer
parameter was replaced with--ignore-multi-kmer
Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
Either mode is only used in Plass and not in Linclust--lca-ranks
from(easy-)taxonomy
andlca
has to be delimited with semicolons (;
) instead of colons (:
)--dont-shuffle
flag was renamed to--shuffle true/false
Features
- new
databases
workflow to list and download common databases.
Supported databases:
Name Type Taxonomy Url
- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
- NR Aminoacid - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB Aminoacid - https://www.rcsb.org
- PDB70 Profile - https://github.com/soedinglab/hh-suite
- Pfam-A.full Profile - https://pfam.xfam.org
- Pfam-A.seed Profile - https://pfam.xfam.org
- eggNOG Profile - http://eggnog5.embl.de
- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari
(easy-)search --slice-search
is now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by--disk-space-limit
createdb
and the variouseasy-
workflows learned to read query input fromSTDIN
taxonomyreport
learned to display the summarized taxonomy result with Krona- new
filtertaxseqdb
module for filtering sequence DBs with taxonomy information according to provided taxa --taxon-list
parameter understands expressions. E.g. get all bacterial and human sequences--taxon-list "2||9606"
easy-search
andconvertalis
can now output taxonomic information using--format-output
taxid Taxonomic identifier
taxname Taxon Name
taxlineage Taxonomic lineage
- speed up in
(easy-)cluster/linclust
by improving k-mer extraction - MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
E.g.:mmseqs createdb input1.fa input2.fa seqDB
each sequence in seqDB can tell if it came frominput1.fa
orinput2.fa
createdb
learned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new databasealign
andrescorediagonal
learned to align circular sequencesalign
exposes the z-drop parameter of its Banded Nucleotide alignment algorithmreverseseq
learned to reverse profilesfilterdb
can filter rows with value within given percentage of first row- new
aggragatetax
module to assign a taxonomic label to contigs according to the fragments matched on the contig - Adjusting
--max-seq-len
is not required anymore, MMseqs2 automatically increases the length now. - MMseqs2 on Cygwin/Windows uses
nedmalloc
as its memory allocator now and does not massively slow down due to lock contention - new
tar2db
module to efficiently transform content oftar
archives to MMseqs2 databases
Bug fixes
createindex
would create corrupted indices for profile target databasesrbh
workflow would create its result DB at an unexpected (wrong) location(easy)-taxonomy --lca-mode 3
(Approx. LCA) was aligning invalid sequences in the second iteration and producing bad resultslca
(and(easy)-taxonomy
) add empty columns for unclassifed sequences to be valid TSVskmermatcher
uses xxhash for hashing now (faster)kmermatcher
avoid crash machine has not enough memory to process data at once (affects linclust/cluster)kmermatcher
correctly deals with sequences longer than MAX_SHRT nowkmermatcher
fixed various edge cases (e.g. alignment of 1-char sequences)kmermatcher
hash-shift would be ignoredoffsetalignment
could produce wrong results in the minus-strandclust
now correctly and consistently handles alignment DB inputclusthash
better deals with nucleotide input now and several multi-threaded inefficiencies were resolved(easy-)cluster
--single-step-clustering
could cluster unrelated sequences due to hash collisionsprefilter --diag-score 0
respects--min-ungapped-score
createseqfiledb
could print empty sequence linestaxonomyreport
could crash if no sequence was unclassifiedresult2flat
could crash with long sequence inputresult2msa, result2profile, msa2profile
backport filtering fix from HHblitsalign
could produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DBsplitsequence
fix issues with splitsequence if combined with compressedresult2profile
fix Filter2 bug of HH-suite in MMseqs2apply
would crash due to reading wrong entry lengthsfilterdb --filter-expression
was not thread safe and could corrupt resultsfilterdb
--extract-lines
and--trim-to-one-column
are compatible with each other
Developers
- Internal representation of sequences changed from 4-byte per character to 1-byte per character
- Compilation under AppleClang + libomp works now (see
util/build_osx.sh
) - Tools inheriting from MMseqs2 can now add their own citations
- MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed
symlinkat
call; relevant for bioconda)