Slow Databricks GLOW implementation #321

bbsun92 · 2021-01-13T18:13:25Z

I am trying to get GLOW WGR to run using the example step by step but with own data (~300k variants in ~350k individuals, ~ 450 binary phenotypes).

After writing the delta files and reading from that, the reducer.fit_trasnform function takes around 3 days:
myreducer = RidgeReducer()
model_df = myreducer.fit_transform(block_df, label_df, sample_blocks, covariate_df)

model_df.cache()

estimator = LogisticRegression()
mymodel_df, cv_df = estimator.fit(model_df,
label_df,
sample_blocks,
covariate_df)

took more than 8 days and failed with: FatalError: SparkException: Job aborted due to stage failure: Task 89 in stage 23.0 failed 4 times, most recent failure: Lost task 89.3 in stage 23.0 (TID 116943, 10.248.234.190, executor 87): ExecutorLostFailure (executor 87 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

I was using the 7.4 Genomics runtime on large clusters of i3.8xlarge driver and 4-22 workers.

This sounds way too long given the reported speeds in the bioarvix paper?

Help would be great on this. Thanks

henrydavidge · 2021-01-13T18:34:48Z

@bbsun92 It looks like you're running into memory pressure because of the large number of phenotypes. In our tests, we typically use batches of 15-20 phenotypes for each run. Could you try that?

Btw, what format is the input data? If you have not done so already, it's best to write the block matrix to a Delta table.

bbsun92 · 2021-01-13T18:43:04Z

Thanks

@bbsun92 It looks like you're running into memory pressure because of the large number of phenotypes. In our tests, we typically use batches of 15-20 phenotypes for each run. Could you try that?

Btw, what format is the input data? If you have not done so already, it's best to write the block matrix to a Delta table.

Thanks, I think tried the same with smaller number of phenotypes and still very slow. The input data is the genetic data in Delta table format as shown in the example code on GLOW. I don't think the intermediate block matrix is written out as another set of Delta table. Also do you have a benchmark of how long stage 1 should take? I gather either way shouldn't be taking 10 days given the large number of nodes being used?

karenfeng · 2021-01-29T18:38:08Z

Hi @bbsun92! I performed similar scale testing following PR #282 (this is included in Genomics Runtime 7.4). In my testing setup, I used the following:

Cluster with a r5d.12xlarge driver and 6 r5d.12xlarge workers
Dataset with 500K samples, 500K variants, 25 phenotypes, 16 covariates, and 5 alphas

The runtime was 15.85 minutes for stage 1 (writing the blocked data to a Delta table) and 85.8 minutes for stage 2.

bbsun92 · 2021-02-19T13:33:33Z

Thanks Karen. Do you have a notebooks/list of exact steps that was taken so I can try and reproduce? Especially for stage one. Thanks!

williambrandler · 2021-02-22T19:33:07Z

hey @bbsun92 this repo contains all the latest notebooks under glow/docs/source/_static/notebooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Databricks GLOW implementation #321

Slow Databricks GLOW implementation #321

bbsun92 commented Jan 13, 2021

henrydavidge commented Jan 13, 2021

bbsun92 commented Jan 13, 2021

karenfeng commented Jan 29, 2021

bbsun92 commented Feb 19, 2021

williambrandler commented Feb 22, 2021

Slow Databricks GLOW implementation #321

Slow Databricks GLOW implementation #321

Comments

bbsun92 commented Jan 13, 2021

henrydavidge commented Jan 13, 2021

bbsun92 commented Jan 13, 2021

karenfeng commented Jan 29, 2021

bbsun92 commented Feb 19, 2021

williambrandler commented Feb 22, 2021