Skip to content

MMseqs2 Release 11-e1a1c

Compare
Choose a tag to compare
@martin-steinegger martin-steinegger released this 11 Feb 22:31
· 918 commits to master since this release

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases module helps to download and setup database. We now have a chat support at chat.mmseqs.com.

Known Issues

  • rbh crashes due to invalid sorting mode (#290)
  • Homebrew's macOS version does not use multiple cores (#289)
  • prefilter results can be unstable between different runs for extremely redundant databases (#277)
  • linclust/cluster can crash for very small input sets (#274)

Breaking Changes

  • kmermatcher --skip-n-repeat-kmer parameter was replaced with --ignore-multi-kmer
    Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
    Either mode is only used in Plass and not in Linclust
  • --lca-ranks from (easy-)taxonomy and lca has to be delimited with semicolons (;) instead of colons (:)
  • --dont-shuffle flag was renamed to --shuffle true/false

Features

  • new databases workflow to list and download common databases.
    Supported databases:
  Name                	Type      	Taxonomy	Url
- UniRef100           	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef90            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef50            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniProtKB           	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL    	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot	Aminoacid 	     yes	https://uniprot.org
- NR                  	Aminoacid 	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                  	Nucleotide	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB                 	Aminoacid 	       -	https://www.rcsb.org
- PDB70               	Profile   	       -	https://github.com/soedinglab/hh-suite
- Pfam-A.full         	Profile   	       -	https://pfam.xfam.org
- Pfam-A.seed         	Profile   	       -	https://pfam.xfam.org
- eggNOG              	Profile   	       -	http://eggnog5.embl.de
- Resfinder           	Nucleotide	       -	https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari            	Nucleotide	     yes	https://github.com/lskatz/Kalamari
  • (easy-)search --slice-search is now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by --disk-space-limit
  • createdb and the various easy- workflows learned to read query input from STDIN
  • taxonomyreport learned to display the summarized taxonomy result with Krona
  • new filtertaxseqdb module for filtering sequence DBs with taxonomy information according to provided taxa
  • --taxon-list parameter understands expressions. E.g. get all bacterial and human sequences --taxon-list "2||9606"
  • easy-search and convertalis can now output taxonomic information using --format-output
taxid      Taxonomic identifier
taxname    Taxon Name
taxlineage Taxonomic lineage
  • speed up in (easy-)cluster/linclust by improving k-mer extraction
  • MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
    E.g.: mmseqs createdb input1.fa input2.fa seqDB each sequence in seqDB can tell if it came from input1.fa or input2.fa
  • createdb learned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new database
  • align and rescorediagonal learned to align circular sequences
  • align exposes the z-drop parameter of its Banded Nucleotide alignment algorithm
  • reverseseq learned to reverse profiles
  • filterdb can filter rows with value within given percentage of first row
  • new aggragatetax module to assign a taxonomic label to contigs according to the fragments matched on the contig
  • Adjusting --max-seq-len is not required anymore, MMseqs2 automatically increases the length now.
  • MMseqs2 on Cygwin/Windows uses nedmalloc as its memory allocator now and does not massively slow down due to lock contention
  • new tar2db module to efficiently transform content of tar archives to MMseqs2 databases

Bug fixes

  • createindex would create corrupted indices for profile target databases
  • rbh workflow would create its result DB at an unexpected (wrong) location
  • (easy)-taxonomy --lca-mode 3 (Approx. LCA) was aligning invalid sequences in the second iteration and producing bad results
  • lca (and (easy)-taxonomy) add empty columns for unclassifed sequences to be valid TSVs
  • kmermatcher uses xxhash for hashing now (faster)
  • kmermatcher avoid crash machine has not enough memory to process data at once (affects linclust/cluster)
  • kmermatcher correctly deals with sequences longer than MAX_SHRT now
  • kmermatcher fixed various edge cases (e.g. alignment of 1-char sequences)
  • kmermatcher hash-shift would be ignored
  • offsetalignment could produce wrong results in the minus-strand
  • clust now correctly and consistently handles alignment DB input
  • clusthash better deals with nucleotide input now and several multi-threaded inefficiencies were resolved
  • (easy-)cluster --single-step-clustering could cluster unrelated sequences due to hash collisions
  • prefilter --diag-score 0 respects --min-ungapped-score
  • createseqfiledb could print empty sequence lines
  • taxonomyreport could crash if no sequence was unclassified
  • result2flat could crash with long sequence input
  • result2msa, result2profile, msa2profile backport filtering fix from HHblits
  • align could produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DB
  • splitsequence fix issues with splitsequence if combined with compressed
  • result2profile fix Filter2 bug of HH-suite in MMseqs2
  • apply would crash due to reading wrong entry lengths
  • filterdb --filter-expression was not thread safe and could corrupt results
  • filterdb --extract-lines and --trim-to-one-column are compatible with each other

Developers

  • Internal representation of sequences changed from 4-byte per character to 1-byte per character
  • Compilation under AppleClang + libomp works now (see util/build_osx.sh)
  • Tools inheriting from MMseqs2 can now add their own citations
  • MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed symlinkat call; relevant for bioconda)