Skip to content

Conversation

@henrydavidge
Copy link
Contributor

From @karenfeng:

What changes are proposed in this pull request?

In the case that the user does not know which alpha values to provide (eg. based on heritability estimates), we should support automatically generating them. These values do not work well in the case that the phenotypes are not on the scale of 1.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

henrydavidge and others added 30 commits May 15, 2020 09:58
…or WGR (projectglow#2)

* blocks

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test vcf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* transformer

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* refactor and conform with ridge namings

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test files

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra file

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* sort_key

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* feat: ridge models for wgr added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Doc strings added for levels/functions.py
Some typos fixed in ridge_model.py
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* ridge_model and RidgeReducer unit tests added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* RidgeRegression unit tests added
test data README added
ridge_udfs.py docstrings added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Changes made to accessing the sample ID map and more docstrings

The map_normal_eqn and score_models functions previously expected the
sample IDs for a given sample block to be found in the Pandas DataFrame,
which mean we had to join them on before the .groupBy().apply().  These
functions now expect the sample block to sample IDs mapping to be
provided separately as a dict, so that the join is no longer required.
RidgeReducer and RidgeRegression APIs remain unchanged.

docstrings have been added for RidgeReducer and RidgeRegression classes.

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Refactored object names and comments to reflect new terminology

Where 'block' was previously used to refer to the set of columns in a
block, we now use 'header_block'
Where 'group' was previously used to refer to the set of samples in a
block, we now use 'sample_block'

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>
…rojectglow#6)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* PyArrow 0.15.1 only with PySpark 3

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't use toPandas()

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Upgrade pyarrow

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Only register once

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Minimize memory usage

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Select before head

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set up/tear down

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try limiting pyspark memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* No teardown

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Extend timeout

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* start changing for readability

* use input label ordering

* rename create_row_indexer

* undo column sort

* change reduce

Signed-off-by: Henry D <henrydavidge@gmail.com>

* further simplify reduce

* sorted alpha names

* remove ordering

* comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Set arrow env var in build

Signed-off-by: Henry D <henrydavidge@gmail.com>

* faster sort

* add test file

* undo test data change

* >=

* formatting

* empty

Co-authored-by: Karen Feng <karen.feng@databricks.com>
* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf transform

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Set driver memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try changing spark mem

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* match java tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* remove driver memory flag

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…ctglow#11)

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* simplify tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* index map compat

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add more tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pass args as ints

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't roll our own splitter

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename sample_index to sample_blocks

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Added necessary modifications to accomodate covariates in model fitting.

The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X.  This PR makes numerous changes to accomodate covariate matrix C.

Adding covariates required the following breaking changes to the APIs:
 * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform():
   * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf)
   * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf)

Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument.

Two new tests have been added to test_ridge_regression.py to test run modes with covariates:
 * test_ridge_reducer_transform_with_cov
 * test_two_level_regression_with_cov

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Cleaned up one unnecessary Pandas import
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Small changes for clarity and consistence with the rest of the code.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Forgot one usage of coalesce
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Added a couple of comments to explain logic and replaced usages of .values with .array
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Fixed one instance of the change .values -> .array where it was made in error.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Typo in test_ridge_regression.py.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Style auto-updates with yapfAll
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Order to match labeldf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check we tie-break

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test var name

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
karenfeng and others added 6 commits June 21, 2020 20:55
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Rename levels to wgr

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename test files

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* headers

* executable

* fix template rendering

* yapf
@henrydavidge henrydavidge merged commit 0c7c637 into projectglow:master Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants