-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand Hail GWAS regression implementation #448
Comments
In sgkit, the Projecting out the covariates from the genotypes (as opposed to just the phenotypes, say) is an expensive operation, since it involves computing an outer matmul whose output is of shape (variants, samples). Note that this operation is at the heart of the performance issues observed in #390. (There is a mitigation in #454, but it's still an expensive operation.) For the UKB GWAS run on Hail, there are a couple of linear regression implementations. In Hail 0.1 there is linreg3, and in Hail 0.2 there is linear_regression_rows. It's not clear to me if either of these projects out covariates from the genotypes. I think it's important to understand if they do or not, so we have a like-for-like comparison in the performance and cost of running GWAS using sgkit and Hail. |
You've likely already seen it, but I believe
|
Exactly - Hail's For the purposes of comparison I think we should have an implementation of the standard linear regression in sgkit. I'd appreciate some help writing it though, as I'm not sure how to express it using NumPy/Dask array operations. |
What makes you say that @tomwhite? There are no random effects in it and it's equivalent to ordinary least-squares/fixed effects models, e.g. the tests are against statsmodels.OLS.
IIRC, lots of other GWAS modeling methods do the covariate projection (e.g. REGENIE and Fast-LMM in addition to BOLT) and it's often used as both a performance enhancement and a way to simplify the code/algebra for the per-variant regressions. It wouldn't surprise me though if, in general, it's an improvement for single-server software but not in a distributed system. I remember being worried at first that it was doing something super-linear in
I don't think Hail does it, or I've never seen anything obvious in the code for it. In theory it wouldn't be difficult to broadcast/repeat the covariates for all the individual regressions instead, but that should mean introducing another dimension into the matrix algebra. It might be worth the effort regardless. I am though at a loss for an intuition on why it would be substantially better that way (if not worse). FYI the REGENIE spark (but not c++) code avoids covariate projection based on projectglow/glow#266 (comment), which wasn't a very convincing reason IMO since every other part of the method operates based on out-of-core chunks anyhow. Maybe there is some wisdom to neither Hail nor Glow doing the projection though. |
Ah, sorry I was mistaken.
|
I tried another experiment, where I used Dask On 8x data the processing time on a 16 node cluster was 77s, compared to 110s from the equivalent run in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568. This is a 1.4x speedup. Translating this into normalized numbers (using https://github.com/pystatgen/sgkit/issues/390#issuecomment-768380382):
This is a ~6x speedup from the original, and if we could use preemptible instances to get a ~5x cost saving, I think that would put us in the same cost ballpark as Hail. Ideally Dask would do this kind of optimization for us so we didn't have to resort to Here's the notebook I used, and the performance report. |
That's amazing @tomwhite!
Do you know what happens using the original code without chunking in the samples dimension (instead of |
Thanks @eric-czech.
No, I haven't tried that. I haven't tried it, but given that the |
@tomwhite I would be +1 to trying "original code without chunking" since that is essentially the case from the suggested "optimisation flow" in https://github.com/pystatgen/sgkit/issues/390#issuecomment-764950336:
|
I tried with chunks: {variant: 64, sample: -1} and it was a bit faster than
Here's the notebook and performance report. |
I updated the benchmarking code to use Dask 2021.3.1 (it was previously using 2.30.0) and I got the same result as before for This version of Dask has the improved matmul implementation (dask/dask#7000), and dask-cloudprovider has had some changes that resulted in alterations. See tomwhite/gwas-benchmark@2931599. |
sgkit
cost estimate on UKB data: Estimate cost of GWAS regression steps related-sciences/ukb-gwas-pipeline-nealelab#32hail
cost estimate on UKB data: Computing resources necessary for GWAS Nealelab/UK_Biobank_GWAS#37We'd like to get our costs closer to Hail's costs. To do so, it would be helpful to understand the Hail implementation and see if there are any ideas in their implementation that we might reuse in ours.
The text was updated successfully, but these errors were encountered: