Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction result only ranging from [0.4,0.5] #12

Open
mingyanisa opened this issue Feb 21, 2022 · 5 comments
Open

Prediction result only ranging from [0.4,0.5] #12

mingyanisa opened this issue Feb 21, 2022 · 5 comments

Comments

@mingyanisa
Copy link

mingyanisa commented Feb 21, 2022

Hi! I have tried to train the Leopard model only on the DNA seq of reference genome and remove the DNase-seq / delta DNase out from the input feature. However, the prediction result only gives the value ranging from [0.4,0.5] and cannot capture any peak while having a high AUPRC score. Has anyone ever experienced this issue?

@yang-dongxu
Copy link

yang-dongxu commented Jun 8, 2022

Hello, I have met the same problem here. How do you solve it?

@yuanfangguan
Copy link

We will see if the leading author give a different comment. but let me give my perspective here.

when you are using dna sequence alone, this information alone is not supposed to tell if a TF binds or not. therefore, a good model should not give extreme large or small values as there is not sufficient confidence.

i am surprised auprc is high, i don't think so-- as the baseline is so low due to extremely limited number of positive example. i think only auroc would be high in this case

@yang-dongxu
Copy link

Sorry, I realize it's due to the bigwig I used to train: I just use the signal bigwig directly but not peak. It will be helpful to add the content on how to generate bigwig for training to the readme. Thank you :>

@Hongyang449
Copy link
Contributor

Hi, the model based on DNA only (without DNase-seq) will not be very informative - for the same TF, it can not distinguish different binding profiles in different cell types. As Yuanfang mentioned, the model will be more "conservative" in predictions and the values could be around 0.5. The key information of DNase-seq is missing to generate high-confident predictions - that's also why e.g. traditional motif-based models have many false positive peaks.
The AUPRC/AUROC scores could be high, even if the values are ranging from [0.4, 0.5]. This is because the AUPRC/AUROC scores are determined by the ranking of predictions, instead of the absolute values. For example, consider a simple task of predicting four positions, predictions (A) = 0.1, 0.3, 0.9, 0.5 and predictions (B) = 0.48, 0.49, 0.51, 0.50. These two predictions (A) and (B) have the same AUPRC/AUROC scores.
For this specific TF-binding task, the percentage of binding sites (the AUPRC baseline), is very low so that the AUPRC scores are typically low.

@Hongyang449
Copy link
Contributor

Sorry, I realize it's due to the bigwig I used to train: I just use the signal bigwig directly but not peak. It will be helpful to add the content on how to generate bigwig for training to the readme. Thank you :>

To generate peak bigwig files, usually you need two steps: (1) call peaks using whatever software and (2) convert peak files into bigwig format. Once you have the peak values, I convert it into bigwig using some in-house codes. You can check out lines 54-72 in this code for your reference. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants