Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand Hail GWAS regression implementation #448

Closed
hammer opened this issue Jan 28, 2021 · 11 comments
Closed

Understand Hail GWAS regression implementation #448

hammer opened this issue Jan 28, 2021 · 11 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. performance

Comments

@hammer
Copy link
Contributor

hammer commented Jan 28, 2021

We'd like to get our costs closer to Hail's costs. To do so, it would be helpful to understand the Hail implementation and see if there are any ideas in their implementation that we might reuse in ours.

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 3, 2021

In sgkit, the gwas_linear_regression function "[removes] sample (i.e. person/individual) covariates through orthogonal projection of both the genetic variant and phenotype data". It cites the BOLT-LMM paper as the source for doing this, where in the Supplementary Information it states "We model covariates by projecting them out from both genotypes and phenotypes, which is equivalent to including them as fixed effects."

Projecting out the covariates from the genotypes (as opposed to just the phenotypes, say) is an expensive operation, since it involves computing an outer matmul whose output is of shape (variants, samples).

https://github.com/pystatgen/sgkit/blob/c210fae783e927a45d5b852105cf42d9b47a32e4/sgkit/stats/association.py#L74

Note that this operation is at the heart of the performance issues observed in #390. (There is a mitigation in #454, but it's still an expensive operation.)

For the UKB GWAS run on Hail, there are a couple of linear regression implementations. In Hail 0.1 there is linreg3, and in Hail 0.2 there is linear_regression_rows. It's not clear to me if either of these projects out covariates from the genotypes. I think it's important to understand if they do or not, so we have a like-for-like comparison in the performance and cost of running GWAS using sgkit and Hail.

@hammer
Copy link
Contributor Author

hammer commented Feb 3, 2021

You've likely already seen it, but I believe linear_regression_rows is the right place to start given Nealelab/UK_Biobank_GWAS#37 (comment):

One key development was an early version of https://hail.is/docs/0.2/methods/stats.html#hail.methods.linear_regression_rows that we used.

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 8, 2021

Exactly - Hail's linear_regression_rows uses standard least-squares linear regression, whereas sgkit's gwas_linear_regression uses a mixed model, which is computationally more expensive.

For the purposes of comparison I think we should have an implementation of the standard linear regression in sgkit. I'd appreciate some help writing it though, as I'm not sure how to express it using NumPy/Dask array operations.

@tomwhite tomwhite added core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. performance labels Feb 9, 2021
@eric-czech
Copy link
Collaborator

sgkit's gwas_linear_regression uses a mixed model

What makes you say that @tomwhite? There are no random effects in it and it's equivalent to ordinary least-squares/fixed effects models, e.g. the tests are against statsmodels.OLS.

It cites the BOLT-LMM paper as the source for doing this

IIRC, lots of other GWAS modeling methods do the covariate projection (e.g. REGENIE and Fast-LMM in addition to BOLT) and it's often used as both a performance enhancement and a way to simplify the code/algebra for the per-variant regressions. It wouldn't surprise me though if, in general, it's an improvement for single-server software but not in a distributed system. I remember being worried at first that it was doing something super-linear in XL (the variants x samples matrix) and then thinking that it wouldn't be a problem based on dask/array/linalg.py#L1404-L1407 (a QR factorization is run on only the much smaller XC). Do you have any sense of which part of that might be blowing things up? Apologies if I missed that somewhere since I'm only partially caught up.

It's not clear to me if either of these projects out covariates from the genotypes. I think it's important to understand if they do or not, so we have a like-for-like comparison in the performance

I don't think Hail does it, or I've never seen anything obvious in the code for it. In theory it wouldn't be difficult to broadcast/repeat the covariates for all the individual regressions instead, but that should mean introducing another dimension into the matrix algebra. It might be worth the effort regardless. I am though at a loss for an intuition on why it would be substantially better that way (if not worse).

FYI the REGENIE spark (but not c++) code avoids covariate projection based on projectglow/glow#266 (comment), which wasn't a very convincing reason IMO since every other part of the method operates based on out-of-core chunks anyhow. Maybe there is some wisdom to neither Hail nor Glow doing the projection though.

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 9, 2021

There are no random effects in it and it's equivalent to ordinary least-squares/fixed effects models, e.g. the tests are against statsmodels.OLS.

Ah, sorry I was mistaken.

I remember being worried at first that it was doing something super-linear in XL (the variants x samples matrix) and then thinking that it wouldn't be a problem based on dask/array/linalg.py#L1404-L1407 (a QR factorization is run on only the much smaller XC). Do you have any sense of which part of that might be blowing things up?

XC @ da.linalg.lstsq(XC, XL)[0] is large: it has shape (variants, samples), whereas the two inputs to the matmul are both small. So there's a lot of communication. I wonder if it's possible to broadcast the covariates (so they are only sent to each worker once), and combine them with each block in XL.

@tomwhite
Copy link
Collaborator

I tried another experiment, where I used Dask map_blocks to independently process each block of variants. This is akin to what Hail does (except I'm still doing covariate projection as discussed above). It's important that the array is not chunked in the samples dimension, which means that the chunk size in the variants dimension has to be quite small. I used 64 to give ~100MB chunks.

On 8x data the processing time on a 16 node cluster was 77s, compared to 110s from the equivalent run in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568. This is a 1.4x speedup.

Translating this into normalized numbers (using https://github.com/pystatgen/sgkit/issues/390#issuecomment-768380382):

  • Original implementation: 150 s * 960 cores / 141910 variants / phenotype = 1.01 core seconds/variant/phenotype
  • Improved chunking: 185 s * 192 cores / 141910 variants / phenotype = 0.25 core seconds/variant/phenotype
  • map_blocks: 77 s * 128 cores / 67552 variants / phenotype = 0.15 core seconds/variant/phenotype

This is a ~6x speedup from the original, and if we could use preemptible instances to get a ~5x cost saving, I think that would put us in the same cost ballpark as Hail.

Ideally Dask would do this kind of optimization for us so we didn't have to resort to map_blocks, but it's good to know that this is a technique we can fall back to if needed.

Here's the notebook I used, and the performance report.

@eric-czech
Copy link
Collaborator

That's amazing @tomwhite!

It's important that the array is not chunked in the samples dimension

Do you know what happens using the original code without chunking in the samples dimension (instead of map_blocks)?
Would the chunking mismatch of https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568 still apply as a limitation?

@tomwhite
Copy link
Collaborator

Thanks @eric-czech.

Do you know what happens using the original code without chunking in the samples dimension (instead of map_blocks)?

No, I haven't tried that.

I haven't tried it, but given that the map_blocks approach is embarrassingly parallel, it's more likely to be robust when running with preemptible instances.

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Feb 17, 2021

@tomwhite I would be +1 to trying "original code without chunking" since that is essentially the case from the suggested "optimisation flow" in https://github.com/pystatgen/sgkit/issues/390#issuecomment-764950336:

To be more concrete, we saw benefit of going from 5216 -> 652 in the variant axis, specifically no spilling (less memory overhead), and increasing the size of samples should reduce the communication/transfer time. So overall to be more precise, I believe we should try to:

  • increase the chunk size in the sample axis as much as practical
  • and if we start spilling reduce the chunk size in the variant axis
  • if we need to increase WT/Memory ratio previous comments describe that
  • so a "degenerated" case would be chunking: {variant: 1, sample: MAX}.

@tomwhite
Copy link
Collaborator

I tried with chunks: {variant: 64, sample: -1} and it was a bit faster than map_blocks on this dataset:

  • original code without chunking in the samples dimension: 68 s * 128 cores / 67552 variants / phenotype = 0.13 core seconds/variant/phenotype

Here's the notebook and performance report.

@tomwhite
Copy link
Collaborator

I updated the benchmarking code to use Dask 2021.3.1 (it was previously using 2.30.0) and I got the same result as before for gwas_simulation_unchunked_sample.ipynb (70s).

This version of Dask has the improved matmul implementation (dask/dask#7000), and dask-cloudprovider has had some changes that resulted in alterations. See tomwhite/gwas-benchmark@2931599.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. performance
Projects
None yet
Development

No branches or pull requests

4 participants