Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

negative values in grampa.csv file #8

Open
UnixJunkie opened this issue Nov 24, 2021 · 10 comments
Open

negative values in grampa.csv file #8

UnixJunkie opened this issue Nov 24, 2021 · 10 comments

Comments

@UnixJunkie
Copy link
Contributor

Some concentration values are negative.
I don't think this is possible, so there is a problem somewhere that introduced those negative values.

@qm-intel
Copy link

qm-intel commented Mar 8, 2023

@UnixJunkie I have the same question. Could you understand why?
The possible answer is that the values are normalized.

@zswitten and @jswitten Thanks for your great contribution. Can you explain the MIC values? If I want to change it to a binary classification problem (AMP and non-AMP) how to decide on threshold value?

@UnixJunkie
Copy link
Contributor Author

I don't know exactly.
It is possible that some standardization procedure shifted the original values.

@jswitten
Copy link
Collaborator

jswitten commented Mar 8, 2023

It's log (MIC in uM) so any MIC < 1uM will have a negative value

Regarding the threshold value, it's totally arbitrary, there's no one absolutely correct way except I guess everything in the database is a positive in some sense. But because thresholding is inherently arbitrary, we used regression in our paper

@jswitten
Copy link
Collaborator

jswitten commented Mar 8, 2023

We did convert to a classification problem in order to benchmark our results and we used both totally random peptides and random peptides from Uniprot I believe, I kind of forget, you can read our paper

@UnixJunkie
Copy link
Contributor Author

From the AMP literature, I would say that having a MIC value <= 32 ug/mL might be a reasonable threshold.
Given the quality of public data for this problem (this is a meta dataset; aggregating values from many different experiments in many different labs, I think that treating the problem as a classification one is way safer than regression).

@qm-intel
Copy link

qm-intel commented Mar 8, 2023

@jswitten Thanks for your reply,

I just draw a histogram of MIC values:

image

It's log (MIC in uM) so any MIC < 1uM will have a negative value

Regarding the threshold value, it's totally arbitrary, there's no one absolutely correct way except I guess everything in the database is a positive in some sense. But because thresholding is inherently arbitrary, we used regression in our paper

In your paper in Section entitled (Ensemble model), you have mentioned:

"The prediction was either very close to 4 (meaning, a predicted inactive peptide) or somewhere between -1 and 3.5 (meaning, a predicted active peptide). Therefore, for the purposes of classification (Section 3.3), instead of averaging over each of the ensemble model predictions, we had each model in the ensemble “vote.” If more than half of the models predicted log MIC > 3.9, we classified the peptide as inactive and predicted log MIC = 4. Otherwise, we classified the peptide as active and the predicted log MIC (used for generation of the ROC curves in place of a probabilistic prediction) was the average over all predictions that were <3.9."

Is log MIC<= 3.5uM your threshold boundary for the active peptides? In that case, the number of non-AMP (inactive peptide) samples for training becomes a very small number (imbalance) compared to active AMPs.

Sorry again for the long question. But I could not find a clear answer in other literature, and your dataset is the only one that I can use for use-case.

@qm-intel
Copy link

qm-intel commented Mar 8, 2023

@UnixJunkie Thanks for your reply,

From the AMP literature, I would say that having a MIC value <= 32 ug/mL might be a reasonable threshold.
Given the quality of public data for this problem (this is a meta dataset; aggregating values from many different experiments in many different labs, I think that treating

Can you please mention the title of one of the papers that have mentioned MIC value <= 32 threshold value?

In some literature the MIC value <= 25 ug/mL has been suggested too. But in GRAMPA the scale is uM. Please see my question above. In this case, what threshold can be decided?
Thanks

@jswitten
Copy link
Collaborator

jswitten commented Mar 8, 2023

What I did was declare all peptides in the dataset to be positives and generate negatives either by generating completely random peptides or by taking random peptides form UniProt, see Table 2, Table S3, and related discussion. So the negatives were synthetically generated and every peptide in the dataset is positive because every peptide in GRAMPA has been reported antimicrobial to something. Or in other words threshold I used was "in GRAMPA vs not in GRAMPA"

@qm-intel
Copy link

qm-intel commented Mar 8, 2023

@jswitten Thank you for the clarification

@UnixJunkie
Copy link
Contributor Author

Some authors from a US lab generate negatives by randomizing the order of amino acids from the sequences of known actives. There is a rational for this procedure: it destroys the hydrophobic moment of known actives, which means such peptides cannot anymore perturbate the membrane of microbes (which is the assumed mode of action for many antimicrobial peptides).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants