Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(genetics_etl): data freeze 10 #82

Closed
wants to merge 8 commits into from
Closed

feat(genetics_etl): data freeze 10 #82

wants to merge 8 commits into from

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Nov 22, 2024

Context

I have experimented on both steps configuration:

dev_efm_config = {
     "dataproc:efm.spark.shuffle": "primary-worker",
     "spark:spark.sql.files.maxPartitionBytes": "1073741824", 
     "spark:spark.sql.shuffle.partitions": "100",
     "yarn:spark.shuffle.io.serverThreads": "50", 
     "spark:spark.shuffle.io.numConnectionsPerPeer": "5",
    "spark:spark.stage.maxConsecutiveAttempts": "10", 
    "spark:spark.task.maxFailures": "10",
}

new_efm_config = {
      "dataproc:efm.spark.shuffle": "primary-worker",
      "spark:spark.sql.adaptive.enabled": "true",
      "spark:spark.sql.files.maxPartitionBytes": "1073741824", 
      "yarn:spark.shuffle.io.serverThreads": "128",  
      "spark:spark.shuffle.io.backlog": "8192",
      "spark:spark.shuffle.io.maxRetries": "50",
      "spark:spark.shuffle.io.numConnectionsPerPeer": "5",
      "spark:spark.shuffle.io.retryWait": "30s",
      "spark:spark.shuffle.io.connectionTimeout": "1m",
      "spark:spark.io.compression.lz4.blockSize": "512KB",
      "spark:spark.shuffle.service.enabled": "true",
      "spark:spark.sql.shuffle.partitions": "100",
      "spark:spark.stage.maxConsecutiveAttempts": "10",
      "spark:spark.task.maxFailures": "10",
      "spark:dynamicAllocationEnabled": "true",
      "spark:spark.rpc.io.serverThreads": "50",
      "spark:spark.shuffle.service.index.cache.size": "2048m",
      "spark:spark.shuffle.service.removeShuffle": "true",
}
Step efm allowed shuffle partitions tweaks Ellapsed Time efm configuration job id job status
Coloc True 100 2g executor memory, 2Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 10 ~35m dev af27c390-15d8-47a7-961c-687daffb655b Failure (consecutive FetchFailedException)
Coloc True 4000 2g executor memory, 2Tb ssd disk size (primary workers) ~24m new 4ffde9db-b624-4c89-8f47-22ab52f60177 Success
eCaviar True 4000 2g executor memory, 2Tb ssd disk size (primary workers) ~30m new f803e9ce-aef5-44c5-b3b6-451116337002 Failure (consecutive FetchFailedException)
eCaviar True 10_000 2g executor memory, 2Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 10 ~25m New 8c521f42-393e-4bd5-a00f-77046d9a63a7 Failure (consecutive FetchFailedException)
eCaviar True 10_000 2g executor memory, 2Tb ssd disksize (primary workers), spark:spark.shuffle.io.maxRetries = 50 ~26m new b85dd7e3-2398-401e-8e0f-792984b9da83 Failure (consecutive FetchFailedException)
eCaviar True 3000 2g executor memory, 4Tb ssd disksize (primary workers), spark:spark.shuffle.io.maxRetries = 50 ~25m New 021e452f-f80f-4ab3-aeaf-5eaa685cffe4 Failure (OOM error)
eCaviar True 3000 4g executor memory, 4Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 50, 4g executor memory overhead ~30m New 3bebc91b-4eb3-4144-aad9-8c622cd71c13 Failure (OOM)
eCaviar True 6000 8g executor memory, 4Tb ssd disk size (primary workers), spark:spark.shuffle.io.maxRetries = 50 ~30m New 3ea91e54-cb46-43fb-bce4-878b696b7950 Failure (consecutive FetchFailed Exception, disk size issues?)
eCaviar False 4000 8g executor memory, 2Tb ssd disk size (primary and secondary workers) ~70m None bcfb9168-d753-466c-bd59-7ade6ce3ad60 Success

Note

By summing up, I think that EFM, although benefitial for shuffling for long running tasks, is causing disk space issues as the partittions are stored just in the primary workers, the no EFM approach will use more disk space for partition storage. Although not tested in this PR context, the assumption is that coloc should also run without the EFM, but increased executor memory, just like eCaviar)

Note

Due to the fact that freeze9 coloc and ecaviar worked without any workarounds and the variantIndex between the freeze9 and freeze10 has not changed a lot (freeze10 contains patched eQTL credible sets by @addramir) most likely solution is that the cause for overlap skewness is the eQTL Catalogue, rather then finngen.

Warning

The colocalisation is still not perfect, the succesfull runs were very fragile and more then half of the tasks at the last overlap stages were failing

Szymon Szyszkowski added 5 commits November 21, 2024 13:46

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@project-defiant project-defiant removed the request for review from d0choa November 22, 2024 11:50
@project-defiant
Copy link
Collaborator Author

project-defiant commented Nov 25, 2024

Coloc testing v2

After discussions with @d0choa we perform additional run with following parameters:

  • n1-highmem-16 instead of n1-standard-16 on secondary workers
  • without EFM
  • spark configuration for the steps
{
"spark.executor.memory": "16G",
"spark.shuffle.partitions": "3200",
"spark.sql.files.maxPartitionBytes": "25000000", # 25Mb instead of default 100Mb
"spark.executor.cores": "2",
"spark.sql.adaptive.enabled": "true",
"spark.shuffle.service.enabled": "true",
}

in two tests:

  • Coloc and Ecaviar run separately with times (18 and 15 minutes) ~77 workers
  • Both steps together on a single cluster with time lesss than 25 minutes. ~100 workers

Note

The ovelaps are highly skewed dataset, the two optimisations

  • increasing the memory of the cluster x2 (from 64G to 128G in machines and making less executors with bigger memory 16G instead of initial (through unsuccesful iteration of 4Gb and 8Gb)
  • Decreasing the maximal partition size at input with maxPartitionBytes

allows the partition output to be much bigger then originally

Link to the runs:

@project-defiant
Copy link
Collaborator Author

Closing this PR as the changes from here were already added by @Javi to the #84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant