Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Databricks GLOW implementation #321

Open
bbsun92 opened this issue Jan 13, 2021 · 5 comments
Open

Slow Databricks GLOW implementation #321

bbsun92 opened this issue Jan 13, 2021 · 5 comments

Comments

@bbsun92
Copy link

bbsun92 commented Jan 13, 2021

I am trying to get GLOW WGR to run using the example step by step but with own data (~300k variants in ~350k individuals, ~ 450 binary phenotypes).

After writing the delta files and reading from that, the reducer.fit_trasnform function takes around 3 days:
myreducer = RidgeReducer()
model_df = myreducer.fit_transform(block_df, label_df, sample_blocks, covariate_df)

model_df.cache()

estimator = LogisticRegression()
mymodel_df, cv_df = estimator.fit(model_df,
label_df,
sample_blocks,
covariate_df)

took more than 8 days and failed with: FatalError: SparkException: Job aborted due to stage failure: Task 89 in stage 23.0 failed 4 times, most recent failure: Lost task 89.3 in stage 23.0 (TID 116943, 10.248.234.190, executor 87): ExecutorLostFailure (executor 87 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

I was using the 7.4 Genomics runtime on large clusters of i3.8xlarge driver and 4-22 workers.

This sounds way too long given the reported speeds in the bioarvix paper?

Help would be great on this. Thanks

@henrydavidge
Copy link
Contributor

@bbsun92 It looks like you're running into memory pressure because of the large number of phenotypes. In our tests, we typically use batches of 15-20 phenotypes for each run. Could you try that?

Btw, what format is the input data? If you have not done so already, it's best to write the block matrix to a Delta table.

@bbsun92
Copy link
Author

bbsun92 commented Jan 13, 2021

Thanks

@bbsun92 It looks like you're running into memory pressure because of the large number of phenotypes. In our tests, we typically use batches of 15-20 phenotypes for each run. Could you try that?

Btw, what format is the input data? If you have not done so already, it's best to write the block matrix to a Delta table.

Thanks, I think tried the same with smaller number of phenotypes and still very slow. The input data is the genetic data in Delta table format as shown in the example code on GLOW. I don't think the intermediate block matrix is written out as another set of Delta table. Also do you have a benchmark of how long stage 1 should take? I gather either way shouldn't be taking 10 days given the large number of nodes being used?

@karenfeng
Copy link
Collaborator

Hi @bbsun92! I performed similar scale testing following PR #282 (this is included in Genomics Runtime 7.4). In my testing setup, I used the following:

  • Cluster with a r5d.12xlarge driver and 6 r5d.12xlarge workers
  • Dataset with 500K samples, 500K variants, 25 phenotypes, 16 covariates, and 5 alphas

The runtime was 15.85 minutes for stage 1 (writing the blocked data to a Delta table) and 85.8 minutes for stage 2.

@bbsun92
Copy link
Author

bbsun92 commented Feb 19, 2021

Thanks Karen. Do you have a notebooks/list of exact steps that was taken so I can try and reproduce? Especially for stage one. Thanks!

@williambrandler
Copy link
Contributor

hey @bbsun92 this repo contains all the latest notebooks under glow/docs/source/_static/notebooks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants