Releases: soedinglab/MMseqs2
Releases · soedinglab/MMseqs2
MMseqs2 Release 6-f5a1c
Changes since release 5-9375b
New features
- Support user defined output format in
convertalis
. - Add parameters for gap open and gap extension costs.
- Improve substitution matrix support. Letters of alphabet can now be chose freely.
- Add a few PAM matrices to the data folder. Chose them with the
--sub-mat
parameter. - Support IUPAC codes in translated search.
- Add parameter to define a spaced k-mer pattern.
- Add a new module
ungappedprefilter
. It computes an optimal ungapped score using a vectorized algorithm.
Bug fixes
- Fix
easy-linclust
parameter parsing issue. - Fix coverage filtering in
align
when the parameter--realign
is set. - Fix sequence identity computation in
rescorediagonal --rescore-mode 2
. - Fix
apply
MPI support. - Fix representative sequence output bug in
result2repseq
. - Fix possible MPI issues in modules creating symlinks.
- Fix slightly wrong E-value computed in
alignall
module.
Known Issues
easy-search
output has only one column. Workaround: Add parameter--format-output ""
.
MMseqs2 Release 5-9375b
Changes since release 4-0b8cc
Bug fixes
- bool flag parameters (e.g.
-a
) work again swapresults
will deterministically rank resultsshellcompletion
does not report run time anymore
MMseqs2 Release 4-0b8cc
Changes since release 3-be8f6
New features
- Alternative alignments in search (
--alt-ali
). Find alignments by masking out previously found regions in the target sequence. - Added
map
workflow for fast near-exact mapping of reads - Added
easy-linclust
workflow, that works on FASTA files - Sequence lengths longer than 32k are now supported (default sequence length limit is now 65535)
createdb
shuffles the order of entries by default (--dont-shuffle
to disable), useful for database splits, where one split could take much longer than otherslinclust
now supports MPIlinclust
adds one hash for the whole sequence, to improve extract sequence matching- New sequence identity computation modes, where the normalization happens on the query or target length instead of alignment length
- New
--cov-mode
that computes the coverage only based on sequence lengths (--cov-mode 3
) search
/cluster
/linclust
workflows have learned--alignment-mode 4
for faster ungapped alignments- Translated
search
sorts now results by E-value and aggregates all ORFs under the corresponding contig identifier prefiltering
can now sort hits with score > 255 correctlyconvertalis
now works with profiles- Added generalized database transposition tool
swapdb
(swapresults
only makes sense for prefiltering/alignment results)
Performance
- Speedup
extractorf
with vectorization - Many performance improvements to reduce overhead for web server mode
createtsv
writes output in parallel- Avoid many unnecessary memory allocations in various modules
Bug fixes
covertmsa
does now correctly parses STOCKHOLM files without accession keys- In
search
when using splits less than--max-seqs
sequences would be the limit, now correctly computes the limit (max-seqs/Splits + 4*sqrt(max_seqs/Splits)) - Fix bug in MsaFilter where wrong sequences would be filtered
swapresults
will add an empty entry if a target entry has no corresponding query match, instead of no entry at allcreateindex
creates now correctly creates a tmp directory if no directory exists already- Fix query split runs for small input databases
result2stats
was reading the wrong first sequence (from query instead of target database)result2repseq
now writes the correct.dbtype
fileconvertalis
now reads the correctdbtype
for the target sequence- Fix empty REG_EMPTY bug on macOS
- Fix possible memory corruption when searching against database indexed by 'createindex'
- Report error if -DHAVE_MPI was set and MPI is not installed on the system
- Avoid race condition in
kmermatcher
(invalid parallel writing to vector) - Fix
msa2profile
header output format msa2profile
uses the FASTA readin mode by default now- Target profile databases and databases build with
--exact-kmer-matching
now correctly extract all k-mers - Fix identical score computation of alignment if clustering using profiles
- Nucleotide backtranslation
translateaa
would produce invalid codons for X
Others
- removed
--early-exit
- Output name of program called
Experimental new modules
- new fast alignment method
alignbykmer
Developers
- Cmake flag
-DHAVE_GPROF
for profiling MMseqs2 using gprof - Fixed most warnings
- SSTR does not use stringstreams anymore
- Refactored time measuring
Debug::INFO/WARNING/ERROR
is now used consistently across the codebase- If available (shellcheck)[https://github.com/koalaman/shellcheck] will critique shell scripts and fail the compilation
MMseqs2 Release 3-be8f6
Changes since 2-23394 Release
New Features
- Create simple workflows fasta/fastq in flat file out for clustering
easy-cluster
and searchingeasy-search
- Add a new clustering greedy incremental clustering algorithm to the
clust
module which needs less memory - Make the new low memory clustering algorithm default if
--cov-mode 1
is used inlinclust
andcluster
- Add
alignall
module for all-against-all alignments of e.g. clusters - Improved Windows support
filterdb
learned new modes
Bug fixes
- Fix wrong merging code in
linclust
- Fix e-value issues in target-split case
- Fix seg. fault in rescore diagonal if 'z' is used
- Fix seg. fault when using masking in
kmermatcher
- Fix wrong
filterdb
default mode prefilter
overestimated the required amount of memory and refused to runprefilter
scores would saturate to early, now they have the full 2^16 range
Others
- Profile searches do create less high scoring false positive through better compositional bias correction and masking of low complexity regions of profiles
- Clustering supports now the whole 2^32 range instead the previously 2^31
- Speed up clustering when using
--cov-mode 1
- Rework symlinks to the header databaes
- Support profiles on query and target side in
result2profile
MMseqs2 Release 2-23394
Changes since 1-c7a89 Release
New Features
- Translated searches (blastx and tblastn like search modes)
- Improvement splitting input sequences in
kmermatcher
(Less memory needed forlinclust
) linclust
supports nucleotide sequences (experimental feature, k-mer length is not yet optimized)search
supports nucleotide-nucleotide searches (preview, not stable yet)pssm2profile
module to print human readable profilesmsa2profile
has a gap match mode to to convert multiple sequences alignments without representative sequence to profile databases- Compute sequence identity in a similar way to BLAST if
--alignment-mode 3
is used apply
module to execute a arbitrary program on each entry of a mmseqs database. Like map from MapReduce.extractorf
can use start/stop codons from alternative translation tablesfilterdb
now can append entries from other databases by looking them upproteinaln2nucl
maps a protein alignment back to its original nucleotide sequencestaxonomy
now can blacklist nodes (per default the unclassified and others nodes)- tmp folder is automatically created, all workflow intermediate results are placed in a subfolder based on the hash of all paths and parameters
Performance Regressions Fixed
- Fixed regression when multiple mmseqs instances were running at the same time
Breaking Command Line Interface Changes
- Incremented index version, old precomputed indices have to be regenerated
- New Profile format, databases generated through
convertprofiledb
andmsa2profile
have to be regenerated - Clustering workflow is now by default cascaded. We replaced the
--cascaded
flag with--single-step-clustering
- Max sequence length of 32768 is now actually validated and enforced
- Each sequence database has now a dbtype file (AA=0, NUC=1, PROFILE=2)
- extractorf was reworked:
*--skip-incomplete
was split into two parameters--contig-start-mode
and--contig-end-mode
*--longest-orf
was reworked into--orf-start-mode
* removed--extend-min
parameter
Others
- Factor four times faster clustering workflow
- Improve speed of
linclust
by a factor of two - Remove 'X' from prefilter index (reduces memory and improves speed at the same sensitivity)
- Fix bugs for Query coverage mode (
--cov-mode 2
) - Clustering is now the same between single and multi threaded version
- Speedup of kmermatcher
- Fix bug in Clust hash. It can now cluster to 1.0 sequence identity
- Improve target profile search, set max-seqs to infinite for alignments.
- Improve speed of
align
if prefilter result fit into memory - Many usability improvements
- Improved suggestions of bash completion
- Expert modules are hidden by default, use
-h
flag to show everything - Speed up
mergeclusters
by a lot - Fix sequence identity print out bug if the id is less than 10%
- MPI Runner variable can now correctly contain further parameters (RUNNER="mpirun -np 4" was not working)
- Enforcing GCC 4.6 compatibilty in our continous integration
Devlopers
- MMseqs2 can now be included in framework mode to subprojects
- DBReader has a SHUFFLE mode
MMseqs2 Release 1-c7a89
Changes since vNatBiotech Release
New Features
- Taxonomy classification workflow with robust 2bLCA computation and fast LCA computation in O(N LogN)
- Support reading .bz2 archives for createdb
- Createdb can turn multiple fasta files into one database now
- Extend prefilter score range to improve order of best hits after prefiltering.
- Automatically split input sequence set based on system RAM in kmermatcher. Linclust can now run with less memory.
Performance Regressions Fixed
- Fixed underperforming iterative-sequence-profile search without a precomputed index table
Breaking Command Line Interface Changes
- Iterative-non-profile-search --sens-step-size changed to --sens-steps (Number of Iterations) (Does not break nested workflows anymore)
Others
- Query coverage mode (--cov-mode 2) for searching
- Clustering is now the same between single and multi threaded version
- Bug fixes in rescorediagonal
- Speedup of kmermatcher
- Speedup and memory reduction of swapresults
- Many usability improvements
Devlopers
- MMseqs2 can now be included in framework mode to subprojects
Nature Biotechnology Release
Release for Nature Biotechnology