Prediction result only ranging from [0.4,0.5] #12

mingyanisa · 2022-02-21T09:19:13Z

Hi! I have tried to train the Leopard model only on the DNA seq of reference genome and remove the DNase-seq / delta DNase out from the input feature. However, the prediction result only gives the value ranging from [0.4,0.5] and cannot capture any peak while having a high AUPRC score. Has anyone ever experienced this issue?

yang-dongxu · 2022-06-08T14:16:33Z

Hello, I have met the same problem here. How do you solve it?

yuanfangguan · 2022-06-08T14:24:54Z

We will see if the leading author give a different comment. but let me give my perspective here.

when you are using dna sequence alone, this information alone is not supposed to tell if a TF binds or not. therefore, a good model should not give extreme large or small values as there is not sufficient confidence.

i am surprised auprc is high, i don't think so-- as the baseline is so low due to extremely limited number of positive example. i think only auroc would be high in this case

yang-dongxu · 2022-06-09T02:20:42Z

Sorry, I realize it's due to the bigwig I used to train: I just use the signal bigwig directly but not peak. It will be helpful to add the content on how to generate bigwig for training to the readme. Thank you :>

Hongyang449 · 2022-06-09T05:08:30Z

Hi, the model based on DNA only (without DNase-seq) will not be very informative - for the same TF, it can not distinguish different binding profiles in different cell types. As Yuanfang mentioned, the model will be more "conservative" in predictions and the values could be around 0.5. The key information of DNase-seq is missing to generate high-confident predictions - that's also why e.g. traditional motif-based models have many false positive peaks.
The AUPRC/AUROC scores could be high, even if the values are ranging from [0.4, 0.5]. This is because the AUPRC/AUROC scores are determined by the ranking of predictions, instead of the absolute values. For example, consider a simple task of predicting four positions, predictions (A) = 0.1, 0.3, 0.9, 0.5 and predictions (B) = 0.48, 0.49, 0.51, 0.50. These two predictions (A) and (B) have the same AUPRC/AUROC scores.
For this specific TF-binding task, the percentage of binding sites (the AUPRC baseline), is very low so that the AUPRC scores are typically low.

Hongyang449 · 2022-06-09T05:09:07Z

Sorry, I realize it's due to the bigwig I used to train: I just use the signal bigwig directly but not peak. It will be helpful to add the content on how to generate bigwig for training to the readme. Thank you :>

To generate peak bigwig files, usually you need two steps: (1) call peaks using whatever software and (2) convert peak files into bigwig format. Once you have the peak values, I convert it into bigwig using some in-house codes. You can check out lines 54-72 in this code for your reference. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction result only ranging from [0.4,0.5] #12

Prediction result only ranging from [0.4,0.5] #12

mingyanisa commented Feb 21, 2022 •

edited

Loading

yang-dongxu commented Jun 8, 2022 •

edited

Loading

yuanfangguan commented Jun 8, 2022

yang-dongxu commented Jun 9, 2022

Hongyang449 commented Jun 9, 2022

Hongyang449 commented Jun 9, 2022

Prediction result only ranging from [0.4,0.5] #12

Prediction result only ranging from [0.4,0.5] #12

Comments

mingyanisa commented Feb 21, 2022 • edited Loading

yang-dongxu commented Jun 8, 2022 • edited Loading

yuanfangguan commented Jun 8, 2022

yang-dongxu commented Jun 9, 2022

Hongyang449 commented Jun 9, 2022

Hongyang449 commented Jun 9, 2022

mingyanisa commented Feb 21, 2022 •

edited

Loading

yang-dongxu commented Jun 8, 2022 •

edited

Loading