Release MMseqs2 Release 11-e1a1c · soedinglab/MMseqs2

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases module helps to download and setup database. We now have a chat support at chat.mmseqs.com.

Known Issues

rbh crashes due to invalid sorting mode (#290)
Homebrew's macOS version does not use multiple cores (#289)
prefilter results can be unstable between different runs for extremely redundant databases (#277)
linclust/cluster can crash for very small input sets (#274)

Breaking Changes

kmermatcher --skip-n-repeat-kmer parameter was replaced with --ignore-multi-kmer
Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
Either mode is only used in Plass and not in Linclust
--lca-ranks from (easy-)taxonomy and lca has to be delimited with semicolons (;) instead of colons (:)
--dont-shuffle flag was renamed to --shuffle true/false

Features

new databases workflow to list and download common databases.
Supported databases:

  Name                	Type      	Taxonomy	Url
- UniRef100           	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef90            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef50            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniProtKB           	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL    	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot	Aminoacid 	     yes	https://uniprot.org
- NR                  	Aminoacid 	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                  	Nucleotide	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB                 	Aminoacid 	       -	https://www.rcsb.org
- PDB70               	Profile   	       -	https://github.com/soedinglab/hh-suite
- Pfam-A.full         	Profile   	       -	https://pfam.xfam.org
- Pfam-A.seed         	Profile   	       -	https://pfam.xfam.org
- eggNOG              	Profile   	       -	http://eggnog5.embl.de
- Resfinder           	Nucleotide	       -	https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari            	Nucleotide	     yes	https://github.com/lskatz/Kalamari

(easy-)search --slice-search is now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by --disk-space-limit
createdb and the various easy- workflows learned to read query input from STDIN
taxonomyreport learned to display the summarized taxonomy result with Krona
new filtertaxseqdb module for filtering sequence DBs with taxonomy information according to provided taxa
--taxon-list parameter understands expressions. E.g. get all bacterial and human sequences --taxon-list "2||9606"
easy-search and convertalis can now output taxonomic information using --format-output

taxid      Taxonomic identifier
taxname    Taxon Name
taxlineage Taxonomic lineage

speed up in (easy-)cluster/linclust by improving k-mer extraction
MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
E.g.: mmseqs createdb input1.fa input2.fa seqDB each sequence in seqDB can tell if it came from input1.fa or input2.fa
createdb learned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new database
align and rescorediagonal learned to align circular sequences
align exposes the z-drop parameter of its Banded Nucleotide alignment algorithm
reverseseq learned to reverse profiles
filterdb can filter rows with value within given percentage of first row
new aggragatetax module to assign a taxonomic label to contigs according to the fragments matched on the contig
Adjusting --max-seq-len is not required anymore, MMseqs2 automatically increases the length now.
MMseqs2 on Cygwin/Windows uses nedmalloc as its memory allocator now and does not massively slow down due to lock contention
new tar2db module to efficiently transform content of tar archives to MMseqs2 databases

Bug fixes

createindex would create corrupted indices for profile target databases
rbh workflow would create its result DB at an unexpected (wrong) location
(easy)-taxonomy --lca-mode 3 (Approx. LCA) was aligning invalid sequences in the second iteration and producing bad results
lca (and (easy)-taxonomy) add empty columns for unclassifed sequences to be valid TSVs
kmermatcher uses xxhash for hashing now (faster)
kmermatcher avoid crash machine has not enough memory to process data at once (affects linclust/cluster)
kmermatcher correctly deals with sequences longer than MAX_SHRT now
kmermatcher fixed various edge cases (e.g. alignment of 1-char sequences)
kmermatcher hash-shift would be ignored
offsetalignment could produce wrong results in the minus-strand
clust now correctly and consistently handles alignment DB input
clusthash better deals with nucleotide input now and several multi-threaded inefficiencies were resolved
(easy-)cluster --single-step-clustering could cluster unrelated sequences due to hash collisions
prefilter --diag-score 0 respects --min-ungapped-score
createseqfiledb could print empty sequence lines
taxonomyreport could crash if no sequence was unclassified
result2flat could crash with long sequence input
result2msa, result2profile, msa2profile backport filtering fix from HHblits
align could produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DB
splitsequence fix issues with splitsequence if combined with compressed
result2profile fix Filter2 bug of HH-suite in MMseqs2
apply would crash due to reading wrong entry lengths
filterdb --filter-expression was not thread safe and could corrupt results
filterdb --extract-lines and --trim-to-one-column are compatible with each other

Developers

Internal representation of sequences changed from 4-byte per character to 1-byte per character
Compilation under AppleClang + libomp works now (see util/build_osx.sh)
Tools inheriting from MMseqs2 can now add their own citations
MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed symlinkat call; relevant for bioconda)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMseqs2 Release 11-e1a1c

Known Issues

Breaking Changes

Features

Bug fixes

Developers