Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capture transportable metadata about cohort definition, as part of the cohort json #2940

Open
gowthamrao opened this issue May 27, 2024 · 2 comments

Comments

@gowthamrao
Copy link
Member

gowthamrao commented May 27, 2024

We have observed that recent changes in the vocabulary to have impact on the operating characteristics of phenotype algorithms - within the same data source. Traditionally, the performance of phenotype algorithms were considered dependent only on the cohort definition and the data sources tested. However, it's becoming apparent that the vocabulary version plays a crucial role.

To better manage, we propose including additional metadata into the cohort definition itself:

  • Include Vocabulary Version in Cohort Metadata:

    • Implement metadata capture within the cohort JSON for the user to record the vocabulary version where the cohort definition was last updated and/or evaluated. This addition will help users track changes and understand the impact of vocabulary updates on their studies.
  • Standardize Metadata Framework:

    • Develop a more generalized framework for metadata using name-value pairs in the JSON format. This should include:
      • Standard fields like vocabularyVersion, firstDevelopedDate, and lastUpdatedDate.
      • Extendable user-defined fields that can describe broader metadata aspects, such as:
        • Library cohort status (e.g., isLibraryCohort: true/false)
        • Peer review status (e.g., isPeerReviewed: true/false)
        • Approval status (e.g., isApproved: true/false)
        • Usage in specific studies (e.g., usedInStudy: Study A)
        • Descriptive text blobs providing additional context or notes.
        • Author(s) attribution
    • Add a global hash signature id that can uniquely identify the cohort json across atlas instances. This hash should update when changes are made to core cohort definition logic.

Some of these metadata are captured in public and private phenotype libraries. However, they are now becoming attributes of the cohort definition that is captured in the context of the library. If we can extend these attributes to be part of cohort json, then it can

  • Facilitate Metadata Transportability:
    • Ensure that this metadata is structured in a way that allows it to be easily transported with the cohort JSON across different systems and studies, enhancing reproducibility and transparency.

This structured approach to metadata management will not only improve the fidelity of cohort definitions in the face of vocabulary changes but also enhance the overall utility and governance of cohorts in Atlas.

Discussed this idea with @dimshitc, Azza Shoaibi

This new metadata approach will make make public and private libraries of cohort definitions more easier to integrate. This allows Atlas to have a "librarian" role to curate definitions for reuse.

@dimshitc
Copy link

JSON can keep the vocabulary version the cohort was created with, and the vocabulary version of the latest update.

@gowthamrao
Copy link
Member Author

Note: a generalizable idea is that this "metadata" can be replacement of other metadata like ideas in the cohort json such as "description text box", or "tags".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants