-
Notifications
You must be signed in to change notification settings - Fork 4
Unipept Next
We are happy to announce the release of Unipept Next, featuring a range of remarkable new enhancements and capabilities:
- Faster! The taxonomic and functional analysis of a metaproteomics sample now takes even less time.
- Support for missed cleavage handling in the Unipept API and CLI.
- Support for the analysis of samples with semi-tryptic and non-tryptic peptides.
- Completely new index structure for matching peptides with proteins, based on suffix arrays.
Over the last year, we have been busy working on a new index structure that can be used by Unipept to match peptides provided by the user with proteins in UniProtKB (and consequently taxonomic and functional information). Previously, Unipept employed a traditional relational database that contained pre-digested tryptic peptides that only allowed for the efficient retrieval of the peptides that are actually present in this database. This means that only perfectly cleaved tryptic peptides are supported. Over time, we've added support for tryptic peptides with missed cleavages, albeit with a significant performance penalty and still without support for semi-tryptic peptides or peptides that are not cleaved by trypsin altogether.
Instead of precomputing and performing an in-silico tryptic digestion of the proteins that are present in UniProtKB, we decided to switch to a suffix array. This data structure allows for the efficient matching of small substrings in a large text, allowing us to find out which proteins a peptide belongs to. Because the suffix array does not require us to make any assumptions about the type of input peptide that it should try to match, it allows us to also query non-tryptic peptides. However, one of the downsides of using suffix arrays is that they require significantly more memory compared to traditional relational databases. This increased memory usage is due to the need to store the array itself and the additional information required for efficient substring searches. As a result, this can lead to higher resource consumption, especially when dealing with large datasets, and may necessitate specialized hardware or optimizations to manage the memory overhead effectively.
While a complete suffix array delivers the best performance, the index itself is large. The size of the suffix array can be reduced by introducing a sampling step to create a so-called sparse suffix array (SSA). This variant of a suffix array only stores every k-th suffix of the input text. This results in an SSA which is only
Unipept Next achieves performance that is on-par with Unipept 5.0 when analyzing traditional tryptic peptides. This means users can expect the same level of efficiency and speed for standard analyses. A significant advancement in Unipept Next, however, is the fact that "advanced missed cleavage handling" feature is always enabled, without experiencing any performance penalty. Furthermore, because of the way this suffix array data structure works, the option can not be disabled anymore.
Right now, the option is always checked in Unipept's user interface and cannot be unchecked anymore. In the future, the option will be removed entirely.