-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Finngen FM results ingestion #394
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #394 +/- ##
==========================================
+ Coverage 85.67% 85.88% +0.21%
==========================================
Files 89 98 +9
Lines 2101 2629 +528
==========================================
+ Hits 1800 2258 +458
- Misses 301 371 +70
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great! I've added some code suggestions and general questions.
It'd be good to see one example of a studyLocus extracted from susie's results.
Thanks!
@@ -15,3 +15,5 @@ title: FinnGen | |||
[FinnGen](https://www.finngen.fi/en) is a research project in genomics and personalized medicine, representing a large public-private partnership. The project has collected and analyzed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases. FinnGen is now expanding its focus to comprehend the progression and biological mechanisms of diseases. This initiative provides a world-class resource for further breakthroughs in disease prevention, diagnosis, and treatment, offering insights into our genetic makeup. | |||
|
|||
For a comprehensive understanding of the dataset and methods, refer to [Kurki et al., 2023](https://www.nature.com/articles/s41586-022-05473-8). | |||
|
|||
We ingested full GWAS sumamry statstics and SuSiE-based fine-mapping results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ingested full GWAS sumamry statstics and SuSiE-based fine-mapping results. | |
We ingested full GWAS summary statistics and SuSiE-based fine-mapping results. |
trigger_rule=TriggerRule.ALL_DONE, | ||
) | ||
# with TaskGroup( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be uncommented. I understand that you're defining 2 processing streams that generate credible sets: ingestion of fine mapping results directly; processing of summary stats
>> ld_clumping | ||
>> pics | ||
>> common.delete_cluster(CLUSTER_NAME) | ||
# >> [finngen_summary_stats_preprocess, finngen_finemapping] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be uncommented as well.
if TYPE_CHECKING: | ||
pass | ||
|
||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The session object has a logger. Do you need this for something in specific?
Returns: | ||
StudyLocus: Processed SuSIE finemapping output in StudyLocus format. | ||
""" | ||
processed_finngen_finemapping_df = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usually a better practice to avoid reading data in the functions that contain business logic, so it's easier to test and debug. You read data in the step, and then call your function injecting the dependencies. Ideally you shouldn't need spark
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -82,6 +88,18 @@ | |||
"nullable": true, | |||
"type": "string" | |||
}, | |||
{ | |||
"metadata": {}, | |||
"name": "credibleSetIndex", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the interpretation of this index? I feel that we've talked about this already, can you remind me if credible set 1 means anything over credible set 2?
config/datasets/gcp.yaml
Outdated
@@ -25,6 +25,9 @@ ukbiobank_manifest: gs://genetics-portal-input/ukb_phenotypes/neale2_saige_study | |||
l2g_gold_standard_curation: ${datasets.inputs}/l2g/gold_standard/curation.json | |||
gene_interactions: ${datasets.inputs}/l2g/interaction # 23.09 data | |||
eqtl_catalogue_paths_imported: ${datasets.inputs}/preprocess/eqtl_catalogue/tabix_ftp_paths_imported.tsv | |||
finngen_finemapping_results_url: gs://genetics-portal-dev-analysis/xg1/Finngen_finemapping_r10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to put this in a more general bucket.
config/datasets/gcp.yaml
Outdated
@@ -25,6 +25,9 @@ ukbiobank_manifest: gs://genetics-portal-input/ukb_phenotypes/neale2_saige_study | |||
l2g_gold_standard_curation: ${datasets.inputs}/l2g/gold_standard/curation.json | |||
gene_interactions: ${datasets.inputs}/l2g/interaction # 23.09 data | |||
eqtl_catalogue_paths_imported: ${datasets.inputs}/preprocess/eqtl_catalogue/tabix_ftp_paths_imported.tsv | |||
finngen_finemapping_results_url: gs://genetics-portal-dev-analysis/xg1/Finngen_finemapping_r10 | |||
finngen_finemapping_summaries_url: gs://genetics-portal-dev-analysis/xg1/Finngen_susie_credset_summary_r10.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
config/datasets/gcp.yaml
Outdated
@@ -25,6 +25,9 @@ ukbiobank_manifest: gs://genetics-portal-input/ukb_phenotypes/neale2_saige_study | |||
l2g_gold_standard_curation: ${datasets.inputs}/l2g/gold_standard/curation.json | |||
gene_interactions: ${datasets.inputs}/l2g/interaction # 23.09 data | |||
eqtl_catalogue_paths_imported: ${datasets.inputs}/preprocess/eqtl_catalogue/tabix_ftp_paths_imported.tsv | |||
finngen_finemapping_results_url: gs://genetics-portal-dev-analysis/xg1/Finngen_finemapping_r10 | |||
finngen_finemapping_summaries_url: gs://genetics-portal-dev-analysis/xg1/Finngen_susie_credset_summary_r10.tsv | |||
finngen_release_prefix: "finngen_R10_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we process Finngen's manifest, studyIds are in upper case
config/datasets/gcp.yaml
Outdated
@@ -43,6 +46,7 @@ catalog_study_locus: ${datasets.study_locus}/catalog_study_locus | |||
gwas_catalog_study_curation: ${datasets.inputs}/v2d/GWAS_Catalog_study_curation.tsv | |||
finngen_study_index: ${datasets.study_index}/finngen | |||
finngen_summary_stats: ${datasets.summary_statistics}/finngen | |||
finngen_finemapping_out: gs://genetics-portal-dev-analysis/xg1/Finngen_finemapping_r10_processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are outputting credible sets, I think this should be written to ${datasets.credible_set}/finngen
…m/opentargets/genetics_etl_python into xg1-finngen-finemapping-ingestion
Close this PR because I couldn't figure out how to merge changes from dev to this branch after the repo name was changed. Comments and other changes will be implemented in the new finngen_fm_ingestion branch |
* feat: ingest finngen r10 finemapping w/ airflow * fix: addressed comments from PR #394 * fix: address comments from PR * Update config/step/ot_finngen_finemapping_ingestion.yaml Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> * Update config/step/ot_finngen_finemapping_ingestion.yaml Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> * chore: remove unnecessary config --------- Co-authored-by: Yakov <yt4@sanger.ac.uk> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk> Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> Co-authored-by: David Ochoa <dogcaesar@gmail.com>
* feat(finemapping): ingest finngen r10 finemapping w/ airflow (#435) * feat: ingest finngen r10 finemapping w/ airflow * fix: addressed comments from PR #394 * fix: address comments from PR * Update config/step/ot_finngen_finemapping_ingestion.yaml Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> * Update config/step/ot_finngen_finemapping_ingestion.yaml Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> * chore: remove unnecessary config --------- Co-authored-by: Yakov <yt4@sanger.ac.uk> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk> Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> Co-authored-by: David Ochoa <dogcaesar@gmail.com> * docs: susie inf method reloacated with the rest of the methods * ci(release): add action to open pr that triggers release weekly (#474) * chore: add action to open pr that triggers release weekly * fix: make yamllint interpret on keyword as string * fix: adapt time to gmt * chore: update to run at 4pm * fix: update github token variable (#476) * ci: exclude changelog.md from precommit (#479) --------- Co-authored-by: Yakov <yt4@sanger.ac.uk> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk> Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com> Co-authored-by: David Ochoa <dogcaesar@gmail.com>
Added "region", "credibleSetIndex" and "credibleSetlog10BF" columns to the studyLocus schema, this accomodates SuSie finemapping outputs.