-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GATK CNV on WGS. #2858
Comments
Reviving this. This will essentially be a major refactor/rewrite of CreatePanelOfNormals to make it scalable enough to handle WGS.
|
@vruano may also have some insight on the PCOV vs RAW issue. |
FYI @sooheelee I pointed you at the docs recently, but realized they're slightly out of date. CreatePanelOfNormals currently expects proportional coverage, upon which it initially performs a series of filtering steps. The docs state that these steps are performed on integer counts, which is incorrect. The fact that filtering yields different post-filter coverage for the two types of input ultimately results in slightly different denoising. Not a big deal, but we should decide what the actual method should be doing and clarify in the code/docs. |
I'll keep an eye out for the "integer counts" passage in the docs and change this to reflect our recommendations. |
To be specific, I'm referring to Sec. IIB in https://github.com/broadinstitute/gatk-protected/blob/master/docs/CNVs/CNV-methods.pdf. |
Can someone explain this pseudoinverse business to me? We standardize the counts to give a matrix C, calculate the SVD to get the left-singular matrix U and the pseudoinverse of C, but then perform another SVD on U to get the pseudoinverse of U. Why do we need these pseudoinverses? @LeeTL1220 @davidbenjamin? |
The memory footprint used by ReadCountCollectionUtils.parse is extremely large compared to the size of the file. I suspect there may be memory leaks in TableReader. Speed is also an issue; pandas is 4x faster (taking ~10s on my desktop) on a 300bp-bin WGS read-count file with ~11.5 million rows if all columns are read, and 10x faster if the Target name field (which is pretty much useless) is ignored. (Also, for some reason, the speed difference is much greater when running within my Ubuntu VM on my Broad laptop; what takes pandas ~15s to load takes ReadCountCollectionUtils.parse so long that I just end up killing it...not sure why this is?) |
TableReader is a dumb-dumb one-record at a time reader so it shouldn't suffer from memory leaks. In contrast the parser() method uses a incremental "buffer" that accumulates the counts until the end when the actual returned table is created... The reason for this is to keep the ReadCountsCollection class constant. So at some point you need at least twice the amount of memory as compare to a solution that would simply use the returned object as the accumulator. |
You might be right---and I think it's worse, in that the ReadCountCollection argument validation (for uniqueness of targets, etc.) also adds to the overhead. |
By default we aim for correctness over efficiency... for what you're saying about the target being useless it sounds that perhaps you should consider to generalize the ReadCountCollection so that in some sub implementations targets are implicit based on their index in the collection and so for those tools that don't care about target-names this operation would go much faster. So RCC could be an interface rather than a concrete class and some child class would implement the current RCC s behavior, whereas some new child class would implement a more efficient solution for WG- fix size interval collections. |
As for TableReader.... we could make it a bit more efficient by reusing DataLine instances. Currently it creates one per each input line, but same instance could be reused loading each new line data onto it before calling We are delegating to a external library to parse the lines into String[] arrays (one element per cell) .... we could save on that by implementing it ourselves more efficiently but of course that would be take one of our some of his/her precious development time... In any case I don't know what the gain would be considering that these operations are done close to I/O that typically should be dominating the time-cost. The DataLine reuse may save some memory churning and wouldn't take long to code. |
First cut at a rewrite seems to be working fine and is much, much leaner. Building a small PoN with 4 simulated normals with 5M bins each, CombineReadCounts took ~1 min, CreatePanelOfNormals (with no QC) took ~4.5 minutes (although ~1 minute of this is writing target weights, which I haven't added to the new version yet) and generated a 2.7GB PoN, and NormalizeSomaticReadCounts took ~8 minutes (~7.5 minutes of which was spent composing/writing results, thanks to overhead from ReadCountCollection). In comparison, the new CreateReadCountPanelOfNormals took ~1 minute (which includes combining read-count files, which takes ~30s of I/O) and generated a 520MB PoN, and DenoiseReadCounts took ~30s (~10s of which was composing/writing results, as we are still forced to generate two ReadCountCollections). Resulting PTN and TN copy ratios were identical down to 1E-16 levels. Differences are only due to removing the unnecessary pseudoinverse computation. Results after filtering and before SVD are identical, despite the code being rewritten from scratch to be more memory efficient (e.g., filtering is performed in place)---phew! If we stored read counts as HDF5 instead of as plain text, this would make things much faster. Perhaps it would be best to make TSV an optional output of the new coverage collection tool. As a bonus, it would then only take a little bit more code to allow previous PoNs to be provided via -I as an additional source of read counts. Remaining TODO's:
Evaluation:
|
I created a panel of normals from 90 WGS TCGA samples with 250bp (~11.5M) bins, which took ~57 minutes total and produced an 11GB PoN (this file includes all of the input read counts---which take up 20GB as a combined TSV file and a whopping 63GB as individual TSV files---as well as the eigenvectors, filtering results, etc.). The run can be found in /dsde/working/slee/wgs-pon-test/tieout/no-gc. It completed successfully with -Xmx32G (in comparison, CreatePanelOfNormals crashed after 40 minutes with -Xmx128G). The runtime breakdown was as follows:
Clearly making I/O faster and more space efficient is the highest priority. Luckily it's also low hanging fruit. The 8643028 x 30 matrix of eigenvectors takes <2 minutes to read from HDF5 when the WGS PoN is used in DenoiseReadCounts, which gives us a rough idea of how long it should take to read in the original ~11.5M x 90 counts from HDF5. So once #3349 is in, then I think that a ~15 minute single-core WGS PoN could easily be viable. I believe that a PoN on the order of this size will be all that is required for WGS denoising, if it is not already overkill. To go bigger by more than an order of magnitude, we'll have to go out of core, which will require more substantial changes to the code. But since the real culprit responsible for hypersegmentation is CBS, rather than insufficient denoising, I'd rather focus on finding a viable segmentation alternative. |
With HDF5 read-count file input enabled by #3349, a WGS PoN with 39 samples x ~1M bins (i.e., 3kb) took <1 minute to build. Thanks @LeeTL1220! |
No problem. What was original timing, roughly?
On Jul 27, 2017 12:26 PM, "samuelklee" <notifications@github.com> wrote:
With HDF5 read-count file input enabled by #3349
<#3349>, a WGS PoN with 39
samples x ~1M bins (i.e., 3kb) took <1 minute to build. Thanks @LeeTL1220
<https://github.com/leetl1220>!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2858 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDXkzOHyLu0ARxwy8dOWqgJSWCO7St5ks5sSLo0gaJpZM4NwWt_>
.
|
Sorry did not see previous comment |
Note the number of samples and bin size are different from that comment. |
@samuelklee I certainly agree about focusing on CBS alternative. |
I've implemented the Gaussian-kernel binary-segmentation algorithm from this paper: https://hal.inria.fr/hal-01413230/document This method uses a low-rank approximation to the kernel to obtain an approximate segmentation in linear complexity in time and space. In practice, performance is actually quite impressive! The implementation is relatively straightforward, clocking in at ~100 lines of python. Time complexity is O(log(maximum number of segments) * number of data points) and space complexity is O(number of data points * dimension of the kernel approximation), which makes use for WGS feasible. Segmentation of 10^6 simulated points with 100 segments takes about a minute and tends to recover segments accurately. Compare this with CBS, where segmentation of a WGS sample with ~700k points takes ~10 minutes---and note that these ~700k points are split up amongst ~20 chromosomes to start! There are a small number of parameters that can affect the segmentation, but we can probably find good defaults in practice. What's also nice is that this method can find changepoints in moments of the distribution other than the mean, which means that it can straightforwardly be used for alternate-allele fraction segmentation. For example, all segments were recovered in the following simulated multimodal data, even though all of the segments have zero mean: Replacing the SNP segmentation in ACNV (which performs expensive maximum-likelihood estimation of the allele-fraction model) with this method would give a significant speedup there. Joint segmentation is straightforward and is simply given by addition of the kernels. However, complete data is still required. Given such a fast heuristic, I'm more amenable to augmenting this method with additional heuristics to clean up or improve the segmentation if necessary. We can also use it to initialize our more sophisticated HMM models, as well. @LeeTL1220 @mbabadi @davidbenjamin I'd be interested to hear your thoughts, if you have any. |
@samuelklee This looks really good and the paper is a great find. I am unclear on a couple of things.
|
@samuelklee this looks awesome!
|
I naively cythonized part of my python implementation to use memoryviews, resulting in a ~60x speedup! The simulated data with 100 segments and 1M points ran in 1.3 seconds!!THIS IS BANANAS!!! |
Yes, I'd like to test out the new tools in the workflow. All of these changes seem major and I'm excited to see how the workflow performs. I'll stop by to chat with you. |
Thanks @samuelklee! |
TCGA WGS at 250bp (~8.5M copy ratios after filtering during the denoising step from ~11.5M read counts, ~840k hets after filtering on count totals <30 and homozygosity from ~5.5M allelic counts) took ~27 minutes to run in matched-normal mode. Results look very reasonable. You can see the result at /dsde/working/slee/CNV-1.5_TCGA_WGS/TCGA-05-4389-01A-01D-1931-08.matched-normal.modeled.png |
Just noticed an updated version of the kernel segmentation paper: https://hal.inria.fr/hal-01413230v2/document The original version referenced in this issue can be found at: https://hal.inria.fr/hal-01413230v1/document A couple of noteworthy changes include the form of the penalty function and the fact that they use an estimate of the variance to normalize the data and set the kernel variance to unity (they also symmetrize the BAF, which I guess is required for the latter). I don't think either change will make a huge difference to the final segmentation. |
@LeeTL1220 commented on Mon May 23 2016
Need to show some measure of performance for the WGS evaluation (GATK CNV only. Not ACNV)
The text was updated successfully, but these errors were encountered: