-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotate four representative OpenNeuro datasets #10
Comments
Data Model related issuesMissing conditions.
Problem with the way we model assessment tools. Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool. That's not great for several reasons:
A participant with two conflicting diagnoses is currently hard to model. Example: Controlled Vocabulary related problemsHaving to look up the controlled terms by hand is pretty annoying (and probably prone to error). If we turn this into a workflow, it will absolutely have to have the possible values pre-configured. measurement that isn't a tool. For example: missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations. Continuous values problemscategorical variable encoded with numbers
need to look inside numeric column to categorize Data quality issuesTool name not recognizable. Some datasets annotate their data with the measured concept rather than the name of the tool. For example "Handedness" instead of "Edinburgh Handedness Inventory". I'm not sure how to annotate this. Cogatlas does have "concepts" like that (that's the whole purpose of the project), but our data model currently expects the range of an assessment edge to be a specific controlled term of a tool. This is probably more of an issue for "bulk annotation" where the user annotating is not the usually the data owner who would have more insight. Wrong or conflicting description. Low quality data dictionaries are an issue because now I don't know who to believe. Example: Duplicate columns or leftover stuff. Some participants.tsv files are pretty low quality. For example Multi-session info in participant.tsv file. This is related to the multiple conflicting diagnosis issue described above. The problem arises from people putting repeated measures in the |
OK, I think this thing is done. Overall summary:
|
missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201 has HADS_Anxiety which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006, even down to the subscale. Hhmm... I had assumed we were going to be mixing vocabularies anyway...
This might point to the need to have controlled vocabulary term metadata on display to help guide the user |
No. We have two principles so far:
There may be some cases now where we are not being consistent about 2 (e.g. healty-controls currently comes from NCIT rather than SNOMED). But that's one of the "lessons learned" from the OMOP folks that we should really stick with: one vocabulary per variable. |
Yeah. It might not be enough to just show the name in a dropdown. Maybe we need, as you say, other metadata. |
Moving into Review - Active, I'll take a look at the tsv as well 🙂 |
(sorry, pressed comment before I was ready) Below are my comments on some of the issues.
I think some options for us are (barring major changes to the current data model) (a) pick the closest available term from the vocab, even if it's not 100% accurate (b) consider creating an
Agree that I think in practice this doesn't work well. Especially because missing values in assessment tool subscales are so common and have many ways to be imputed during statistical analysis, I don't think it'd be very useful to impose an "all or none" approach at the cohort definition stage. I think for now we can loosen our constraint on this to annotate a subject as "having" an assessment if they have non-missing values for any (up to all) of the columns for a tool.
Since diagnosis is currently handled at the subject level and not the session level, I think for bulk annotations the best we can do is store the diagnosis at baseline, and potentially flag longitudinal data using another "Decision" option in the spreadsheet (maybe "revisit"?) We probably want to create another issue to discuss how/if we want to start modelling phenotypic info at the session level. I imagine for age also this will be important soon.
I would be strongly in favor of subdividing our current "Assessment" class (which I feel is too broad for one vocab), into at minimum "Cognitive Assessment" and "Clinical Assessment". Due to the conceptual focus of the Cognitive Atlas it makes sense that it would have pretty limited coverage of terms for instruments to measure severity of specific illnesses, and I think the number of tools we would not be able to model (esp if want to be able to support clinical/patient annotations) could quickly outnumber those we can if we stick to just this vocab for every assessment. One idea then could be to have a
What do you mean by "look inside" (e.g.,
I would agree that for these types of issues, we would just have to say that we can't model the column due to "poor data quality". On the bright side, I think this process is revealing the importance of prospective rather than retroactive bulk annotation, because these types of errors are very challenging to resolve by a third party/after the fact. +1 for annotation tool route. |
Thanks for your comments @alyssadai! I agree, we should discuss each of these. I'll link this conversation on the internal wiki to keep a record and then move some of these points over there: https://github.com/neurobagel/documentation/wiki/Neurobagel-Data-Model-limitations |
Good datasets are:
Here is the main GDrive spreadsheet with all datasets.
To complete:
a copyedit: a new sheet of the original .tsv in Annotate the OpenNeuro datasets #2 so that we can use these data for further processingFor reviewer: please take a look at the document and see if you have any notes on how we could make this easier to parse.
The text was updated successfully, but these errors were encountered: