-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518
Conversation
…FromSource` nullable
… into il-eqtl-susie
… into il-eqtl-susie
…pe based on quantization method
…qtlcat credible sets
@@ -325,3 +327,24 @@ def parse_efos(efo_uri: Column) -> Column: | |||
""" | |||
colname = efo_uri._jc.toString() | |||
return f.array_sort(f.expr(f"regexp_extract_all(`{colname}`, '([A-Z]+_[0-9]+)')")) | |||
|
|||
|
|||
def get_logsum(arr: NDArray[np.float64]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function only makes sense in the context of StudyLocus
. It can be a method there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and probably make it private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! As discussed, the StudyLocus incorporates a wrapper around this function that uses it in the specific context of calculating log10BFs for the credible set.
The same function is used differently for COLOC to calculate posterior probabilities for all H4 hypothesis, so not exactly StudyLocus specific.
We'll leave it like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Think about the logsum and we can merge it.
@@ -325,3 +327,24 @@ def parse_efos(efo_uri: Column) -> Column: | |||
""" | |||
colname = efo_uri._jc.toString() | |||
return f.array_sort(f.expr(f"regexp_extract_all(`{colname}`, '([A-Z]+_[0-9]+)')")) | |||
|
|||
|
|||
def get_logsum(arr: NDArray[np.float64]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and probably make it private
Returns: | ||
DataFrame: Log Bayes Factors DataFrame. | ||
""" | ||
return session.spark.read.csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we clearly have different ways of understanding the world 🤣
@@ -40,24 +43,45 @@ class EqtlCatalogueStudyIndex: | |||
StructField("quant_method", StringType(), True), | |||
] | |||
) | |||
raw_studies_metadata_path = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/master/data_tables/dataset_metadata.tsv" | |||
raw_studies_metadata_path = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/19929ff6a99bf402194292a14f96f9615b35f65f/data_tables/dataset_metadata.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
…apping results (#518) * feat: dataflow decompress prototype (#501) * chore: commit susie results gist * feat(study_index): add `tissueFromSourceId` to schema and make `traitFromSource` nullable * fix: bug and linting fixes in new eqtl ingestion step * perf: config bugfixes and performance improvements * perf: remove data persistance to avoid executor failure * perf: load susie results for studies of interest only * perf: collect locus for leads only and optimise partitioning cols * feat: parametrise methods to include * feat: run full dag * test: add tests * fix: reorder test inputs * docs: update eqtl catalogue docs * fix: correct typos in tests docstrings * refactor: change mqtl_quantification_methods to mqtl_quantification_methods_blacklist * feat: studyId is based on measured trait and not on gene * feat: credible set lead is the variant with highest pip * feat(studies): change logic in _identify_study_type to extract qtl type based on quantization method * refactor: externalise reading logic to source classes * chore: add mqtl_quantification_methods_blacklist to yaml config * docs: update docs * fix(dag): pass bucket name to GCSDeleteBucketOperator * refactor(coloc): move get_logsum function to common utils * feat(studylocus): add calculate_credible_set_log10bf and use it for eqtlcat credible sets * fix: credible sets dataset is too large and cant be broadcasted * fix(dag): use GCSDeleteObjectsOperator instead of GCSDeleteBucketOperator * fix: correct typo * fix: correct typo
This PR processes all eQTL Catalogue fine mapping results.
studyId
definition. We go frompublication_tissue_gene
topublication_tissue_measuredtrait
.StudyLocus.calculate_credible_set_log10bf
method. This aggregates all single LBFs in a credible set.The processing job takes 20 minutes. Full DAG will be ~50 minutes.
This is how the DAG looks like