-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(eqtl_catalogue): study index improvements #369
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: df5a074
This pattern didn't work. modified_id
is my suggested fix:
(
df.select(f.lit("PROJECT_QTLGROUP_GENEID").alias("original_id"))
.withColumn("current_id", f.regexp_extract(f.col("original_id"), r"(.*)_[\_]+", 1))
.withColumn("modified_id", f.regexp_extract(f.col("original_id"), r"(.*)_[^_]+", 1))
.show(1)
)
+--------------------+----------+----------------+
| original_id|current_id| modified_id|
+--------------------+----------+----------------+
|PROJECT_QTLGROUP_...| |PROJECT_QTLGROUP|
+--------------------+----------+----------------+
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #369 +/- ##
==========================================
+ Coverage 85.67% 86.84% +1.16%
==========================================
Files 89 92 +3
Lines 2101 2447 +346
==========================================
+ Hits 1800 2125 +325
- Misses 301 322 +21
|
I haven't yet run them, the job was running for 42min but got automatically cancelled. |
The study index we had for eQTL Catalogue ( The new study index is here:
Ad hoc script can be inspected here: |
Sorry @ireneisdoomed , I had to jump on other urget things, so I couldn't sped much time on the PR. However regarding your questions:
No. The actual measurement was the expression level, so it should be the gene. I assume there was this problem that this field is mandatory for generating the study index, however without the summary stats, you don't know the gene, so no valid study index can be constructed. Given we already have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Irene, all changes look good to me and it's a good improvement & generalisation of the study index construction
@DSuveges @ireneisdoomed Regarding placement and meaning of study/index fields: the reason I went with the particular implementation I did was the least disruptive alignment with the existing schema. The actual trait being measured is "expression of gene X in tissue Y", which isn't possible to express in a single column in any meaningful way. So my idea of the least disruptive change was to set trait to tissue (implicitly meaning: trait = "expression in tissue Y"), and to populate gene as a separate column. We could set the trait to the generic "gene expression measurement", but then we would have to necessarily add the "tissueId" column as well. Which isn't a bad thing, but is something to keep in mind I think. If we decide to go with the "gene expression measurement" + tissueId + geneId way, I think it's best to do it as a separate PR to not keep this one waiting, but again that's up for discussion. |
It's a complicated question, but I would not store tissue id in trait column to avoid ambiguity, however we'll soon have to deal with cell ids as well. I'm wondering if we should add a separate column for that too: |
@DSuveges This makes sense, and I like the idea with tissueId + cellId + geneId! |
@ireneisdoomed Yes, I'm happy to make these changes. But just for clarity we're going ahead with merging this PR as is, right? |
This PR includes several changes to the eQTL Catalogue study index parser:
studyId
, now studies are also split by gene.studyId
to extract the gene