negative values in grampa.csv file #8

UnixJunkie · 2021-11-24T07:08:01Z

Some concentration values are negative.
I don't think this is possible, so there is a problem somewhere that introduced those negative values.

qm-intel · 2023-03-08T03:09:32Z

@UnixJunkie I have the same question. Could you understand why?
The possible answer is that the values are normalized.

@zswitten and @jswitten Thanks for your great contribution. Can you explain the MIC values? If I want to change it to a binary classification problem (AMP and non-AMP) how to decide on threshold value?

UnixJunkie · 2023-03-08T03:14:58Z

I don't know exactly.
It is possible that some standardization procedure shifted the original values.

jswitten · 2023-03-08T03:17:49Z

It's log (MIC in uM) so any MIC < 1uM will have a negative value

Regarding the threshold value, it's totally arbitrary, there's no one absolutely correct way except I guess everything in the database is a positive in some sense. But because thresholding is inherently arbitrary, we used regression in our paper

jswitten · 2023-03-08T03:19:22Z

We did convert to a classification problem in order to benchmark our results and we used both totally random peptides and random peptides from Uniprot I believe, I kind of forget, you can read our paper

UnixJunkie · 2023-03-08T03:20:07Z

From the AMP literature, I would say that having a MIC value <= 32 ug/mL might be a reasonable threshold.
Given the quality of public data for this problem (this is a meta dataset; aggregating values from many different experiments in many different labs, I think that treating the problem as a classification one is way safer than regression).

qm-intel · 2023-03-08T07:52:22Z

@jswitten Thanks for your reply,

I just draw a histogram of MIC values:

It's log (MIC in uM) so any MIC < 1uM will have a negative value

Regarding the threshold value, it's totally arbitrary, there's no one absolutely correct way except I guess everything in the database is a positive in some sense. But because thresholding is inherently arbitrary, we used regression in our paper

In your paper in Section entitled (Ensemble model), you have mentioned:

"The prediction was either very close to 4 (meaning, a predicted inactive peptide) or somewhere between -1 and 3.5 (meaning, a predicted active peptide). Therefore, for the purposes of classification (Section 3.3), instead of averaging over each of the ensemble model predictions, we had each model in the ensemble “vote.” If more than half of the models predicted log MIC > 3.9, we classified the peptide as inactive and predicted log MIC = 4. Otherwise, we classified the peptide as active and the predicted log MIC (used for generation of the ROC curves in place of a probabilistic prediction) was the average over all predictions that were <3.9."

Is log MIC<= 3.5uM your threshold boundary for the active peptides? In that case, the number of non-AMP (inactive peptide) samples for training becomes a very small number (imbalance) compared to active AMPs.

Sorry again for the long question. But I could not find a clear answer in other literature, and your dataset is the only one that I can use for use-case.

qm-intel · 2023-03-08T08:06:55Z

@UnixJunkie Thanks for your reply,

From the AMP literature, I would say that having a MIC value <= 32 ug/mL might be a reasonable threshold.
Given the quality of public data for this problem (this is a meta dataset; aggregating values from many different experiments in many different labs, I think that treating

Can you please mention the title of one of the papers that have mentioned MIC value <= 32 threshold value?

In some literature the MIC value <= 25 ug/mL has been suggested too. But in GRAMPA the scale is uM. Please see my question above. In this case, what threshold can be decided?
Thanks

jswitten · 2023-03-08T12:21:05Z

What I did was declare all peptides in the dataset to be positives and generate negatives either by generating completely random peptides or by taking random peptides form UniProt, see Table 2, Table S3, and related discussion. So the negatives were synthetically generated and every peptide in the dataset is positive because every peptide in GRAMPA has been reported antimicrobial to something. Or in other words threshold I used was "in GRAMPA vs not in GRAMPA"

qm-intel · 2023-03-08T13:28:55Z

@jswitten Thank you for the clarification

UnixJunkie · 2023-03-09T00:39:36Z

Some authors from a US lab generate negatives by randomizing the order of amino acids from the sequences of known actives. There is a rational for this procedure: it destroys the hydrophobic moment of known actives, which means such peptides cannot anymore perturbate the membrane of microbes (which is the assumed mode of action for many antimicrobial peptides).

zswitten mentioned this issue Mar 9, 2023

Converting the regression MIC value problem to a classification problem AMP vs. non-AMP. How? #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

negative values in grampa.csv file #8

negative values in grampa.csv file #8

UnixJunkie commented Nov 24, 2021

qm-intel commented Mar 8, 2023

UnixJunkie commented Mar 8, 2023

jswitten commented Mar 8, 2023

jswitten commented Mar 8, 2023

UnixJunkie commented Mar 8, 2023

qm-intel commented Mar 8, 2023 •

edited

Loading

qm-intel commented Mar 8, 2023 •

edited

Loading

jswitten commented Mar 8, 2023

qm-intel commented Mar 8, 2023

UnixJunkie commented Mar 9, 2023

negative values in grampa.csv file #8

negative values in grampa.csv file #8

Comments

UnixJunkie commented Nov 24, 2021

qm-intel commented Mar 8, 2023

UnixJunkie commented Mar 8, 2023

jswitten commented Mar 8, 2023

jswitten commented Mar 8, 2023

UnixJunkie commented Mar 8, 2023

qm-intel commented Mar 8, 2023 • edited Loading

qm-intel commented Mar 8, 2023 • edited Loading

jswitten commented Mar 8, 2023

qm-intel commented Mar 8, 2023

UnixJunkie commented Mar 9, 2023

qm-intel commented Mar 8, 2023 •

edited

Loading

qm-intel commented Mar 8, 2023 •

edited

Loading