feat(genetics_etl): data freeze 10 #82

project-defiant · 2024-11-22T11:11:45Z

Context

Succesful run of genetics etl at 2024.11.22 including full run in gs://ot_orchestration/releases/24.11_freeze10
minor fix to the create_cluster function, as default values (num_workers == 1) fails to. build cluster. Must be more then one worker.
Patched eqtl_catalogue dataset
L2G Model training skipped (use of existing model in gcs)
Tweaks to the colocalisation infrastructure that allowed to run eCaviar step succestully
eCaviar step succesful run - https://console.cloud.google.com/dataproc/jobs/bcfb9168-d753-466c-bd59-7ade6ce3ad60/monitoring?region=europe-west1&project=open-targets-genetics-dev
Coloc step run - https://console.cloud.google.com/dataproc/jobs/4ffde9db-b624-4c89-8f47-22ab52f60177/monitoring?region=europe-west1&project=open-targets-genetics-dev

I have experimented on both steps configuration:

dev_efm_config = {
     "dataproc:efm.spark.shuffle": "primary-worker",
     "spark:spark.sql.files.maxPartitionBytes": "1073741824", 
     "spark:spark.sql.shuffle.partitions": "100",
     "yarn:spark.shuffle.io.serverThreads": "50", 
     "spark:spark.shuffle.io.numConnectionsPerPeer": "5",
    "spark:spark.stage.maxConsecutiveAttempts": "10", 
    "spark:spark.task.maxFailures": "10",
}

new_efm_config = {
      "dataproc:efm.spark.shuffle": "primary-worker",
      "spark:spark.sql.adaptive.enabled": "true",
      "spark:spark.sql.files.maxPartitionBytes": "1073741824", 
      "yarn:spark.shuffle.io.serverThreads": "128",  
      "spark:spark.shuffle.io.backlog": "8192",
      "spark:spark.shuffle.io.maxRetries": "50",
      "spark:spark.shuffle.io.numConnectionsPerPeer": "5",
      "spark:spark.shuffle.io.retryWait": "30s",
      "spark:spark.shuffle.io.connectionTimeout": "1m",
      "spark:spark.io.compression.lz4.blockSize": "512KB",
      "spark:spark.shuffle.service.enabled": "true",
      "spark:spark.sql.shuffle.partitions": "100",
      "spark:spark.stage.maxConsecutiveAttempts": "10",
      "spark:spark.task.maxFailures": "10",
      "spark:dynamicAllocationEnabled": "true",
      "spark:spark.rpc.io.serverThreads": "50",
      "spark:spark.shuffle.service.index.cache.size": "2048m",
      "spark:spark.shuffle.service.removeShuffle": "true",
}

Step	efm allowed	shuffle partitions	tweaks	Ellapsed Time	efm configuration	job id	job status
Coloc	True	100	2g executor memory, 2Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 10	~35m	dev	af27c390-15d8-47a7-961c-687daffb655b	Failure (consecutive FetchFailedException)
Coloc	True	4000	2g executor memory, 2Tb ssd disk size (primary workers)	~24m	new	4ffde9db-b624-4c89-8f47-22ab52f60177	Success
eCaviar	True	4000	2g executor memory, 2Tb ssd disk size (primary workers)	~30m	new	f803e9ce-aef5-44c5-b3b6-451116337002	Failure (consecutive FetchFailedException)
eCaviar	True	10_000	2g executor memory, 2Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 10	~25m	New	8c521f42-393e-4bd5-a00f-77046d9a63a7	Failure (consecutive FetchFailedException)
eCaviar	True	10_000	2g executor memory, 2Tb ssd disksize (primary workers), spark:spark.shuffle.io.maxRetries = 50	~26m	new	b85dd7e3-2398-401e-8e0f-792984b9da83	Failure (consecutive FetchFailedException)
eCaviar	True	3000	2g executor memory, 4Tb ssd disksize (primary workers), spark:spark.shuffle.io.maxRetries = 50	~25m	New	021e452f-f80f-4ab3-aeaf-5eaa685cffe4	Failure (OOM error)
eCaviar	True	3000	4g executor memory, 4Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 50, 4g executor memory overhead	~30m	New	3bebc91b-4eb3-4144-aad9-8c622cd71c13	Failure (OOM)
eCaviar	True	6000	8g executor memory, 4Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 50	~30m	New	3ea91e54-cb46-43fb-bce4-878b696b7950	Failure (consecutive FetchFailed Exception, disk size issues?)
eCaviar	False	4000	8g executor memory, 2Tb ssd disk size (primary and secondary workers)	~70m	None	bcfb9168-d753-466c-bd59-7ade6ce3ad60	Success

Note

By summing up, I think that EFM, although benefitial for shuffling for long running tasks, is causing disk space issues as the partittions are stored just in the primary workers, the no EFM approach will use more disk space for partition storage. Although not tested in this PR context, the assumption is that coloc should also run without the EFM, but increased executor memory, just like eCaviar)

Note

Due to the fact that freeze9 coloc and ecaviar worked without any workarounds and the variantIndex between the freeze9 and freeze10 has not changed a lot (freeze10 contains patched eQTL credible sets by @addramir) most likely solution is that the cause for overlap skewness is the eQTL Catalogue, rather then finngen.

Warning

The colocalisation is still not perfect, the succesfull runs were very fragile and more then half of the tasks at the last overlap stages were failing

project-defiant · 2024-11-25T16:24:30Z

Coloc testing v2

After discussions with @d0choa we perform additional run with following parameters:

n1-highmem-16 instead of n1-standard-16 on secondary workers
without EFM
spark configuration for the steps

{
"spark.executor.memory": "16G",
"spark.shuffle.partitions": "3200",
"spark.sql.files.maxPartitionBytes": "25000000", # 25Mb instead of default 100Mb
"spark.executor.cores": "2",
"spark.sql.adaptive.enabled": "true",
"spark.shuffle.service.enabled": "true",
}

in two tests:

Coloc and Ecaviar run separately with times (18 and 15 minutes) ~77 workers
Both steps together on a single cluster with time lesss than 25 minutes. ~100 workers

Note

The ovelaps are highly skewed dataset, the two optimisations

increasing the memory of the cluster x2 (from 64G to 128G in machines and making less executors with bigger memory 16G instead of initial (through unsuccesful iteration of 4Gb and 8Gb)
Decreasing the maximal partition size at input with maxPartitionBytes

allows the partition output to be much bigger then originally

Link to the runs:

coloc
ecaviar
both together 58b9ef9d-208a-4a02-aef5-309aaf637d53 and 48ffbad3-19f0-4656-b653-2112df9463ab

project-defiant · 2024-11-26T10:45:35Z

Closing this PR as the changes from here were already added by @Javi to the #84

Szymon Szyszkowski added 5 commits November 21, 2024 13:46

feat: freeze8

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

32b22d6

feat: freeze9

808b12f

feat: freeze10

54cc9cc

feat: coloc infra iteration

89ea696

feat: new model

Loading
Loading status checks…

59bf1c4

project-defiant requested a review from d0choa November 22, 2024 11:12

Merge branch 'dev' into freeze10

Loading
Loading status checks…

5405cba

project-defiant removed the request for review from d0choa November 22, 2024 11:50

Merge branch 'dev' into freeze10

Loading
Loading status checks…

c11ad46

project-defiant requested review from d0choa and Daniel-Considine November 22, 2024 12:36

feat: optimized configuration for coloc steps

Loading
Loading status checks…

9bb1b43

project-defiant closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(genetics_etl): data freeze 10 #82

feat(genetics_etl): data freeze 10 #82

project-defiant commented Nov 22, 2024 •

edited

Loading

project-defiant commented Nov 25, 2024 •

edited

Loading

project-defiant commented Nov 26, 2024

feat(genetics_etl): data freeze 10 #82

feat(genetics_etl): data freeze 10 #82

Conversation

project-defiant commented Nov 22, 2024 • edited Loading

Context

project-defiant commented Nov 25, 2024 • edited Loading

Coloc testing v2

project-defiant commented Nov 26, 2024

project-defiant commented Nov 22, 2024 •

edited

Loading

project-defiant commented Nov 25, 2024 •

edited

Loading