Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate sequences in benchmark data #24

Closed
iaposto opened this issue Oct 23, 2024 · 7 comments
Closed

Duplicate sequences in benchmark data #24

iaposto opened this issue Oct 23, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@iaposto
Copy link

iaposto commented Oct 23, 2024

Hello. Congrats on the paper and very interesting tool!
I am working with the data provided in the documentation, specifically the New AutoPeptideML Benchmarks set that you used for model development.
I noticed there are duplicate sequences in the training and test sets for most bioactivity datasets, as well as overlap of peptides in the two sets. Was this intended?

@RaulFD-creator
Copy link
Collaborator

RaulFD-creator commented Oct 25, 2024

Hi @iaposto, thanks for your kind words! The short answer is that it is something intended, but not the ideal scenario.

The duplicated sequences appear in datasets for which not enough negative samples could be drawn from the AutoPeptideML - Peptipedia subset. This database is fairly big, but for datasets with a lot of samples like Antibacterial or Antimicrobial (which also comprise a significant amount of Peptipedia) and after excluding overlapping bioactivities, positive and negative samples were unbalanced. AutoPeptideML, by default, oversamples (randomly duplicates) the underrepresented class, in this case, the negative peptides. Hence the duplicated entries. I have double-checked in case something had slipped through the cracks but I couldn't find any instance of a duplicated positive entry. Please, if you have found any let me know as it may be due to some bug in the code that needs to be addressed.

In the end, the datasets keep the duplicated entries for the sake of reproducibility, but there may be cases where users may have better strategies for handling the unbalance where it makes sense to drop duplicated entries.

Regarding, the overlap between peptides in the two sets, could you clarify what do you mean?

I hope that I have been able to address your question. Please do not hesitate to follow up if you have any further questions.

@iaposto
Copy link
Author

iaposto commented Oct 27, 2024

@RaulFD-creator thank you for the prompt and detailed answer!

@RaulFD-creator
Copy link
Collaborator

RaulFD-creator commented Nov 11, 2024

Hi @iaposto,

I've been double checking what you said about overlap between training and testing sets, and I found something alarming which is that there is some sequences that are in both sets as you indicated. Here are the statistics:

Dataset Overlapping sequences Total sequences
AB 85 13245
ACE 26 1685
ACP 17 1378
AF 8 1591
AMAP 0 221
AMP 0 10266
AOX 12 698
APP 6 482
AV 135 4711
BBP 0 191
DPPIV 16 1063
MRSA 5 237
Neuro 10 3881
QS 0 349
TOX 0 3092
TTCA 67 948

This is a behaviour that was not supposed to happen. I've been trying to track down the bug that caused it. Apparently the default configuration of the software we are using for computing sequence similarity is not well optimised for short sequences (< 20 residues) and it can miss some exact matches (see MMSeqs2 issue #373). The solution I've come up with is to perform the alignment with three sets of settings:

  1. Sequences with length > 20: Use previous, default MMSeqs2 configuration
  2. Sequences with lengths between 9 and 20, both included: Use MMSeqs2 configuration described in MMSeqs2 issue #373.
  3. Sequences smaller than 9: Report exact matches.

The code and details for this new setup are included in the sequence_similarity_peptides function in Hestia-OOD version 0.0.33. Version 1.0.3 of AutoPeptideML will by default support this alignment and the other alternative supported will be needle (which can also fail to detect exact matches, but only when the difference in length between query and target is great).

Regarding the existing benchmarks, I'm going to provide a link in the README file to an amended version where the overlapping sequences are removed, so that it is as close as possible to the benchmarks used in the paper. The version of the benchmarks with the overlap will remain for reproducibility purposes. We believe that the amount of overlapping sequences is small enough that it will not have a significant impact in the results reported in the paper.

Finally, I wanted to thank you for bringing this issue to our attention, as it was a serious bug that needed to be fixed.

@RaulFD-creator RaulFD-creator added the bug Something isn't working label Nov 11, 2024
@RaulFD-creator RaulFD-creator self-assigned this Nov 11, 2024
RaulFD-creator added a commit that referenced this issue Nov 11, 2024
🛠️ Code-dev: Update default parameters (see #24)
@iaposto
Copy link
Author

iaposto commented Nov 11, 2024

@RaulFD-creator thanks for the update, glad to know I helped identify this issue.

I've been using the ccpart() function of Hestia-OOD to make the train/test split of my dataset. Has the improvement you've described been reflected there? Or do I need to call the updated sequence_similarity_peptides() instead of calculate_similarity()?

@RaulFD-creator
Copy link
Collaborator

RaulFD-creator commented Nov 11, 2024

Hi @iaposto, in the latest Hestia-OOD release (0.0.34), the calculate_similarity() function has been removed to make the code a bit easier to maintain. To use ccpart(), I'd recommend the following:

from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart

sim_df = sequence_similarity_peptides(df, field_name='sequence')
train, test = cc_part(df, threshold=threshold, test_size=0.2, sim_df=sim_df)

If you have any problems with the new code, do not hesitate to let me know.

@iaposto
Copy link
Author

iaposto commented Nov 11, 2024

Hi @RaulFD-creator, thank you for the prompt response, for me, it works as follows:

sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test, partition_labs = ccpart(df, threshold=threshold, test_size=0.2, sim_df=sim_df)

@RaulFD-creator
Copy link
Collaborator

Hi @iaposto, perfect, thanks, I forgot about the partition_labs output, sorry for the inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants