-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose number of samples for emitting denoised copy ratios in gCNV. #5754
Comments
Sampling method optimized in #5781. |
Default changed from 250 -> 20 in #5699 and exposed in #7450. Unfortunately, I don't recall if we did any benchmarking to spot check runtime/output changes, but I would expect 20 samples to be sufficient for estimating posterior means and standard deviations to the level needed for most downstream use cases. |
@samuelklee ahem... aren't you on vacation? |
A bit of work on vacation here and there usually means I can take more vacations later! |
@samuelklee Can you briefly explain how the sampling is used in dCR approximations and how they are used downstream? I would expect only 20 samples to give noisy estimates of mean and variance… |
The only place I’ve used them is in the malaria genotyping method, where we calculate very simple annotations for each set of breakpoints—mostly summary statistics of the posterior means, like the minimum in a segment or the segment-level mean. We also calculate a single changepoint statistic using the posterior means. Didn’t see any indicators that posterior sampling noise was causing any issues—the only real issue I had was fixed in #7261. I think the initial reason we started emitting these is because Talkowski folks wanted to use them for plotting/visualization, in which case I would be even less worried about sampling noise. But maybe you all are using them for something else now? Even then, not sure if anyone is using the posterior variances. If we ever move towards emitting this for somatic/mosaicism at the single-bin, percent level, then I might be a bit more worried—in which case, good for us for exposing it! But if you’d like to increase the default and it doesn’t seem to add significant runtime, go for it! Might also be good to get some understanding of the sampling noise level for typical data, if only for a few different parameter values. |
Thanks for that explanation. I'd rather set it to a safer value like 100 as a default - @asmirnov239 do you have any sense of how much this affects run time? |
Looking at this again, plots in #5764 might be useful. Probably OK to bump things up since the rest of inference will take up the bulk of runtime. Although I think one other consideration might be the WDL tests, since they run slow on Travis and already push the time limit. Perhaps set a lower value for those, if necessary? In any case, please double check all of this! |
Interesting. We fixed the memory spike due to sampling correct? However, I wouldn't want to use more samples if it increases peak memory. |
Yup! The sampling and calculation of means/stdevs is done online (I think @asmirnov239 added this) so it shouldn’t take additional memory to use more samples. |
Thanks for making and documenting those plots, @asmirnov239! Does seem worthwhile to sample more if it makes no difference in the runtime. Just curious, what was the shard size? Slightly counterintuitive that the high end is more noisy, but I guess it must be due to sampling noise of low bias---I'd expect more competition from reduced noise due to higher counts. Mind sharing the num_samples = 20 and 200 dCR files for at least one sample so I can take a quick look? Might also be nice to see the posterior standard deviations, but if it's too much work to compile those it's fine. Perhaps we should just concatenate them anyway (I don't recall why we didn't originally add this in #5823, but maybe we had a reason). In any case, the noise doesn't look crazy in the CR regime that would've been important for the CNV genotyping stuff---phew! |
The shard size was what we usually use for exomes: 12500. By the way I put a cutoff on all sites with dCR > 50 for plotting these. It makes sense to increase the default to 200 so I changed in in the PR #7450. |
Thanks for that info and for sharing the files, @asmirnov239. I suspect that there are essentially two types of bins: "nice" and "not so nice". The sampling noise in the former is determined by Poisson observation noise, whereas that in the latter is determined by uncertainty in the bias posteriors. This is a bit hard to see in the plots above, and even in this version where I tried to adjust the point size and alpha: However, plotting a measure of the difference in the dCRs (from 20 and 200 posterior samples) vs. the dCR is more suggestive: As are the dCR histograms: I would guess that the nice spike around CR ~ 2 and the fatter base extending up to dCR ~ 100 are distinct populations of bins. So the punchline would be that differences at high dCR are probably just noise within the noise. For "nice" bins at dCR ~ few, the sampling noise looks to be <1%. Not really sure what's going on at very high dCR, but I think it's safe to say that these are "not so nice" bins! I've seen this pattern in other WES cohorts when plotting the posterior means vs. std devs for the biases; tried to dig up the plots on Slack, but I can't find them at the moment. Perhaps something along those lines might be worth visualizing in your model-criticism notebooks, if you don't already? Again, hard to say this is indeed the case from the dCRs alone, but if so, it might be worth baking this sort of mixture into future versions of the model or coming up with other strategies to deal with such bins. |
@samuelklee that's interesting thanks for making these. I have only observed seen this behavior for 1kG samples so far. I can add the extra plot you mentioned to the plotting notebook. It would be good to characterize these bins and see if it possible to devise an a priori filter. |
Closed by #7450 |
This is set to an unexposed 250 samples. Since the sampling and posterior estimation is done online to prevent the entire samples x intervals x dCR samples matrix from causing OOM, I think we suffer from lack of vectorization.
However, the good news is that we can probably get by with far fewer samples, say ~20) if all we want are estimates of posterior mean and standard deviation---not even sure if the latter will be used by downstream analyses anytime soon.
The text was updated successfully, but these errors were encountered: