-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to biological variation #97
Comments
Hi @ilia-kats Thanks for the comment and your thoughts on the problem. Excess variability is something that has been observed in various places so it is a known weakness of the model. I have looked at it a couple of times and been unable to come up with a good fix so anything you can add would be much appreciated. It sounds like you have a good understanding of this already so the next bit is mostly for me to get my head around it (and for anyone else who follows this conversation). These are the (somewhat simplified) steps in the current Splat model:
I think what you are suggesting is that Step 4 should be performed before Step 3 (correct me if I'm wrong)? I think we considered this at the time and decided that it made sense to do the variability adjustment on the library size adjusted mean but I can see how that could lead to excess variability. I think this should be a fairly easy change to make but it would be nice to have some examples to show that it makes a difference. Would you be willing to help out with that? |
Yes, that is exactly what I was saying. However, I have played around with that a little bit in the meantime, and after exchanging steps 3 and 4, the dispersions still don't match the real data. I was able to generate something realistic by additionally adjusting the trend function. So to summarize, what I did (for the data set that I'm working on currently) was:
in
in |
Thanks for the update! I think this is about the point I have got to before. The BCV equation/estimation probably needs to be modified in some way but I wasn't able to come up with anything that generalised to different datasets. If you come up with anything that seems to work I would be very keen to see it! |
I observe much higher dispersions in simulated data compared to the underlying real data. As far as I understand, the underlying biological assumption is that each gene has an associated biological variation, which is captured by the negative binomial overdispersion parameter and is a decreasing function of this gene's average expression. This is also why in the splatter paper, you talk about gene-wise dispersion.
That would mean that the proper generative model would be to sample from the dispersion prior, and then scale it by the gene's expression. Expression here is the biological expression.
But how Splatter's model is set up is that the sampled dispersion is scaled by gene expression times library size. I think that this is where the additional dispersion comes from, now also the library size prior affects the NB dispersion parameter.
My question now is: Am I completely misunderstanding how this is supposed to work? If not, I would be up for a pull request to do the library size scaling in a later step.
Originally posted by @ilia-kats in #71 (comment)
The text was updated successfully, but these errors were encountered: