-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding GWAS Catalog study curation #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding and documenting the file! I envision having L2G's curation here as well.
I have comments specifically about the column names, let me know your thoughts.
docs/genetics.md
Outdated
### Schema | ||
|
||
- **studyId** - GCST study accession to identify study | ||
- **analysisFlag** - comment on the applied statistical method authors used that might have downstream implication in our pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column name in the data is upateAnalysisFlags
.
Looking at the possible categories ("Case-case study", "GxG", "Multivariate analysis", "GxE"), I'd rename this field to studySubType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although these labels might of fall into a "subtype" category (however I'm 100% agree), there's no such requirement that this field would only contain such.
docs/genetics.md
Outdated
### Schema | ||
|
||
- **studyId** - GCST study accession to identify study | ||
- **analysisFlag** - comment on the applied statistical method authors used that might have downstream implication in our pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **analysisFlag** - comment on the applied statistical method authors used that might have downstream implication in our pipelines. | |
- **studySubType** - description of the specific statistical methodology employed in the GWAS. |
docs/genetics.md
Outdated
|
||
- **studyId** - GCST study accession to identify study | ||
- **analysisFlag** - comment on the applied statistical method authors used that might have downstream implication in our pipelines. | ||
- **updateStudyType** - if a study is not really a GWAS, but a qtl. This string will be picked up and replace the `type` value in the study index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave the comment of how we use this field for the pipeline documentation. Therefore, I suggest naming it to studyType
- **updateStudyType** - if a study is not really a GWAS, but a qtl. This string will be picked up and replace the `type` value in the study index. | |
- **studyType** - categorises the study as either GWAS or molQTL. |
docs/genetics.md
Outdated
- **studyId** - GCST study accession to identify study | ||
- **analysisFlag** - comment on the applied statistical method authors used that might have downstream implication in our pipelines. | ||
- **updateStudyType** - if a study is not really a GWAS, but a qtl. This string will be picked up and replace the `type` value in the study index. | ||
- **qualityControls** - `|` separated list of identified issues that prevent the study from ingestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column name in the data is upateQualityControls.
Because of the same reason as above, I'd rename it to qualityControls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code that processes these columns there's a join that updates similarly named fields. It would make that code more complex. However at some point we come back to this.
Yes, that is certainly a desired path, also we should collate UKBB, FINNGEN trait curation here. Not sure if we should unify diease/trait mappings via ontoma though. There are arguments in both directions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having "updated" in the studyType and QualityControls headers makes it look like these columns are relative to the columns in another file. I think this curation can be used as a standalone file, not just in conjunction with the GWAS Catalog study index. So I think it's cleaner to move the logic into the genetics ETL when we use this file in a specific context.
This is my view, not a blocker if you think this solution is better. The only thing that I do think we need to fix is the typos: updated
instead of upate
.
I have removed the update prefix from the column names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes :)
As described in issue #3132, the manually curated GWAS Catalog studies will be stored in the curation repo. As it is now, the code responsible for annotating gwas catalog studies based on curation can take this table as it is. Even if the URL to the file in the repo is provided.