-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate sequences in benchmark data #24
Comments
Hi @iaposto, thanks for your kind words! The short answer is that it is something intended, but not the ideal scenario. The duplicated sequences appear in datasets for which not enough negative samples could be drawn from the AutoPeptideML - Peptipedia subset. This database is fairly big, but for datasets with a lot of samples like Antibacterial or Antimicrobial (which also comprise a significant amount of Peptipedia) and after excluding overlapping bioactivities, positive and negative samples were unbalanced. AutoPeptideML, by default, oversamples (randomly duplicates) the underrepresented class, in this case, the negative peptides. Hence the duplicated entries. I have double-checked in case something had slipped through the cracks but I couldn't find any instance of a duplicated positive entry. Please, if you have found any let me know as it may be due to some bug in the code that needs to be addressed. In the end, the datasets keep the duplicated entries for the sake of reproducibility, but there may be cases where users may have better strategies for handling the unbalance where it makes sense to drop duplicated entries. Regarding, the overlap between peptides in the two sets, could you clarify what do you mean? I hope that I have been able to address your question. Please do not hesitate to follow up if you have any further questions. |
@RaulFD-creator thank you for the prompt and detailed answer! |
Hi @iaposto, I've been double checking what you said about overlap between training and testing sets, and I found something alarming which is that there is some sequences that are in both sets as you indicated. Here are the statistics:
This is a behaviour that was not supposed to happen. I've been trying to track down the bug that caused it. Apparently the default configuration of the software we are using for computing sequence similarity is not well optimised for short sequences (< 20 residues) and it can miss some exact matches (see MMSeqs2 issue #373). The solution I've come up with is to perform the alignment with three sets of settings:
The code and details for this new setup are included in the Regarding the existing benchmarks, I'm going to provide a link in the README file to an amended version where the overlapping sequences are removed, so that it is as close as possible to the benchmarks used in the paper. The version of the benchmarks with the overlap will remain for reproducibility purposes. We believe that the amount of overlapping sequences is small enough that it will not have a significant impact in the results reported in the paper. Finally, I wanted to thank you for bringing this issue to our attention, as it was a serious bug that needed to be fixed. |
🛠️ Code-dev: Update default parameters (see #24)
@RaulFD-creator thanks for the update, glad to know I helped identify this issue. I've been using the |
Hi @iaposto, in the latest from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
sim_df = sequence_similarity_peptides(df, field_name='sequence')
train, test = cc_part(df, threshold=threshold, test_size=0.2, sim_df=sim_df) If you have any problems with the new code, do not hesitate to let me know. |
Hi @RaulFD-creator, thank you for the prompt response, for me, it works as follows:
|
Hi @iaposto, perfect, thanks, I forgot about the |
Hello. Congrats on the paper and very interesting tool!
I am working with the data provided in the documentation, specifically the
New AutoPeptideML Benchmarks
set that you used for model development.I noticed there are duplicate sequences in the training and test sets for most bioactivity datasets, as well as overlap of peptides in the two sets. Was this intended?
The text was updated successfully, but these errors were encountered: