Duplicate sequences in benchmark data #24

iaposto · 2024-10-23T11:12:06Z

Hello. Congrats on the paper and very interesting tool!
I am working with the data provided in the documentation, specifically the New AutoPeptideML Benchmarks set that you used for model development.
I noticed there are duplicate sequences in the training and test sets for most bioactivity datasets, as well as overlap of peptides in the two sets. Was this intended?

The text was updated successfully, but these errors were encountered:

RaulFD-creator · 2024-10-25T21:09:06Z

Hi @iaposto, thanks for your kind words! The short answer is that it is something intended, but not the ideal scenario.

The duplicated sequences appear in datasets for which not enough negative samples could be drawn from the AutoPeptideML - Peptipedia subset. This database is fairly big, but for datasets with a lot of samples like Antibacterial or Antimicrobial (which also comprise a significant amount of Peptipedia) and after excluding overlapping bioactivities, positive and negative samples were unbalanced. AutoPeptideML, by default, oversamples (randomly duplicates) the underrepresented class, in this case, the negative peptides. Hence the duplicated entries. I have double-checked in case something had slipped through the cracks but I couldn't find any instance of a duplicated positive entry. Please, if you have found any let me know as it may be due to some bug in the code that needs to be addressed.

In the end, the datasets keep the duplicated entries for the sake of reproducibility, but there may be cases where users may have better strategies for handling the unbalance where it makes sense to drop duplicated entries.

Regarding, the overlap between peptides in the two sets, could you clarify what do you mean?

I hope that I have been able to address your question. Please do not hesitate to follow up if you have any further questions.

iaposto · 2024-10-27T16:26:06Z

@RaulFD-creator thank you for the prompt and detailed answer!

…oved peptide alignment (#24)

RaulFD-creator · 2024-11-11T12:45:57Z

Hi @iaposto,

I've been double checking what you said about overlap between training and testing sets, and I found something alarming which is that there is some sequences that are in both sets as you indicated. Here are the statistics:

Dataset	Overlapping sequences	Total sequences
AB	85	13245
ACE	26	1685
ACP	17	1378
AF	8	1591
AMAP	0	221
AMP	0	10266
AOX	12	698
APP	6	482
AV	135	4711
BBP	0	191
DPPIV	16	1063
MRSA	5	237
Neuro	10	3881
QS	0	349
TOX	0	3092
TTCA	67	948

This is a behaviour that was not supposed to happen. I've been trying to track down the bug that caused it. Apparently the default configuration of the software we are using for computing sequence similarity is not well optimised for short sequences (< 20 residues) and it can miss some exact matches (see MMSeqs2 issue #373). The solution I've come up with is to perform the alignment with three sets of settings:

Sequences with length > 20: Use previous, default MMSeqs2 configuration
Sequences with lengths between 9 and 20, both included: Use MMSeqs2 configuration described in MMSeqs2 issue #373.
Sequences smaller than 9: Report exact matches.

The code and details for this new setup are included in the sequence_similarity_peptides function in Hestia-OOD version 0.0.33. Version 1.0.3 of AutoPeptideML will by default support this alignment and the other alternative supported will be needle (which can also fail to detect exact matches, but only when the difference in length between query and target is great).

Regarding the existing benchmarks, I'm going to provide a link in the README file to an amended version where the overlapping sequences are removed, so that it is as close as possible to the benchmarks used in the paper. The version of the benchmarks with the overlap will remain for reproducibility purposes. We believe that the amount of overlapping sequences is small enough that it will not have a significant impact in the results reported in the paper.

Finally, I wanted to thank you for bringing this issue to our attention, as it was a serious bug that needed to be fixed.

🛠️ Code-dev: Update default parameters (see #24)

iaposto · 2024-11-11T14:54:45Z

@RaulFD-creator thanks for the update, glad to know I helped identify this issue.

I've been using the ccpart() function of Hestia-OOD to make the train/test split of my dataset. Has the improvement you've described been reflected there? Or do I need to call the updated sequence_similarity_peptides() instead of calculate_similarity()?

RaulFD-creator · 2024-11-11T15:13:04Z

Hi @iaposto, in the latest Hestia-OOD release (0.0.34), the calculate_similarity() function has been removed to make the code a bit easier to maintain. To use ccpart(), I'd recommend the following:

from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart

sim_df = sequence_similarity_peptides(df, field_name='sequence')
train, test = cc_part(df, threshold=threshold, test_size=0.2, sim_df=sim_df)

If you have any problems with the new code, do not hesitate to let me know.

iaposto · 2024-11-11T16:00:35Z

Hi @RaulFD-creator, thank you for the prompt response, for me, it works as follows:

sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test, partition_labs = ccpart(df, threshold=threshold, test_size=0.2, sim_df=sim_df)

RaulFD-creator · 2024-11-11T16:10:00Z

Hi @iaposto, perfect, thanks, I forgot about the partition_labs output, sorry for the inconvenience.

iaposto closed this as completed Oct 27, 2024

RaulFD-creator added a commit that referenced this issue Nov 11, 2024

🛠️ Code-dev: Updated homology partitioning to Hestia (v.0.0.32). Impr…

1575b4d

…oved peptide alignment (#24)

RaulFD-creator added a commit that referenced this issue Nov 11, 2024

📝 Docs: Added amended version of Benchmarks (#24)

b8a8705

RaulFD-creator added a commit that referenced this issue Nov 11, 2024

🛠️ Code-dev: Update default parameters (see #24)

b37ab3e

RaulFD-creator added the bug Something isn't working label Nov 11, 2024

RaulFD-creator self-assigned this Nov 11, 2024

RaulFD-creator added a commit that referenced this issue Nov 11, 2024

Merge pull request #31 from IBM/code-dev

fcc66ff

🛠️ Code-dev: Update default parameters (see #24)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate sequences in benchmark data #24

Duplicate sequences in benchmark data #24

iaposto commented Oct 23, 2024

RaulFD-creator commented Oct 25, 2024 •

edited

Loading

iaposto commented Oct 27, 2024

RaulFD-creator commented Nov 11, 2024 •

edited

Loading

iaposto commented Nov 11, 2024

RaulFD-creator commented Nov 11, 2024 •

edited

Loading

iaposto commented Nov 11, 2024

RaulFD-creator commented Nov 11, 2024

Duplicate sequences in benchmark data #24

Duplicate sequences in benchmark data #24

Comments

iaposto commented Oct 23, 2024

RaulFD-creator commented Oct 25, 2024 • edited Loading

iaposto commented Oct 27, 2024

RaulFD-creator commented Nov 11, 2024 • edited Loading

iaposto commented Nov 11, 2024

RaulFD-creator commented Nov 11, 2024 • edited Loading

iaposto commented Nov 11, 2024

RaulFD-creator commented Nov 11, 2024

RaulFD-creator commented Oct 25, 2024 •

edited

Loading

RaulFD-creator commented Nov 11, 2024 •

edited

Loading

RaulFD-creator commented Nov 11, 2024 •

edited

Loading