Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GREI HDV Task: Determine whether/how Dataverse can support hierarchical vocabularies #236

Open
5 tasks
Tracked by #174
cmbz opened this issue Apr 30, 2024 · 19 comments
Open
5 tasks
Tracked by #174
Assignees
Labels
Dataverse Project Issues related to Dataverse Project software GREI Year 3 Year 3 GREI task GREI 2 Consistent Metadata Harvard Dataverse Issues related to Harvard Dataverse Repository Project: NIH GREI Tasks related to the NIH GREI project

Comments

@cmbz
Copy link
Contributor

cmbz commented Apr 30, 2024

Overview

  • Determine whether/how Dataverse can support hierarchical vocabularies

Deliverables

  • 1. Defining what hierarchal support means for DV/HDV. What would this look like for DV? What are the goals and how do we know that a solution meets those goals?
    • Then either:
    • Confirmation that Dataverse can currently provide hierarchical vocabulary support, or
    • Documentation describing how to implement hierarchical vocabulary support in Dataverse

Tasks:

  • Determine which vocabulary we will support first: UMLS or MeSH based on how large/complex they are, where they are hosted, and how often they are used.
  • Develop mockup based on example mockups we find

Resources

@cmbz cmbz changed the title GREI HDV Task: Determine whether/how Dataverse can support hierarchical vocabularies (community need) GREI HDV Task: Determine whether/how Dataverse can support hierarchical vocabularies Apr 30, 2024
@cmbz cmbz added GREI 2 Consistent Metadata Project: NIH GREI Tasks related to the NIH GREI project labels Apr 30, 2024
@cmbz cmbz added GREI Year 3 Year 3 GREI task Harvard Dataverse Issues related to Harvard Dataverse Repository Dataverse Project Issues related to Dataverse Project software labels May 7, 2024
@cmbz cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Jul 1, 2024
@sbarbosadataverse
Copy link

sbarbosadataverse commented Jul 19, 2024

Sonia and Julian met and discussed additional steps to getting this task done. See updates to "deliverables" above. Julian estimates he can devote time to this issue in mid September

@cmbz
Copy link
Contributor Author

cmbz commented Aug 5, 2024

2024/08/05

  • Stefano reports that Jim indicated that the Dataverse script for external vocabularies can support hierarchical vocabularies, but work will be needed to store the data properly in Dataverse.
  • Gustavo indicates that the internal vocabulary does not currently support hierarchies. Discussion happened, but implementation did not progress. Design work would needed to implement support.
  • September: @qqmyers @jggautier and @scolapasta will meet to discuss options and possibilities.

@qqmyers
Copy link
Member

qqmyers commented Aug 5, 2024

I think I said the opposite - with the external vocab mechanism, Dataverse just stores a term URI, so there are no changes needed to support a hierarchical vocabulary - all the changes would be in the JavaScript (to be developed for a give vocabulary/service) where it should be simple to find a widget/mirror what other sites do, etc. to handle hierarchy or graph relations, etc.)

@cmbz
Copy link
Contributor Author

cmbz commented Aug 5, 2024

Thanks @qqmyers for clarifying. @siacus and @scolapasta please see Jim's comment for an update to our understanding about hierarchical vocab support.

@cmbz cmbz removed their assignment Nov 18, 2024
@cmbz
Copy link
Contributor Author

cmbz commented Nov 18, 2024

2024/11/18: Ask @qqmyers to take a look, recommend next steps (e.g., development work needed), create relevant development issues.

@qqmyers
Copy link
Member

qqmyers commented Nov 18, 2024

I'd suggest some ~non-dev work to start:

  • pick a vocab to start with, UMLS, or MeSH
  • see if there is a service or repository where we can get the official values in the repository and or where we can link to to provide more info about a term (as we point to a user's ORCID profile)
  • find an example/do a mockup of what navigating the vocabulary should look like
  • assure that all we want/need for now is hierarchical navigation (versus, for example, needed to navigate to related terms elsewhere in the hierarchy)
  • define how the term should be displayed - by itself? as part of a hierarchy of parent terms?
  • define where the term will be added to Dataverse, i.e. which field in which block (note - making it one type of thing that goes in the citation keyword field makes development somewhat more complex than if its a separate field, e.g. in the HEAL block or elsewhere)
  • identify if/where the term should go in the various metadata exports we have. (OAI-ORE and our JSON are trivial, the question is more about DataCite, DDI, etc.)

With the answers above, I think it should be straight-forward to scope the JavaScript work needed to support the input and display, identify whether there's work related to a new metadata block, whether updates are needed to exporters, etc.

@cmbz
Copy link
Contributor Author

cmbz commented Nov 19, 2024

@jggautier and @sbarbosadataverse do you have suggestions for the first bullet points Jim suggested here: #236 (comment) ?

@jggautier
Copy link

Hmm, I'll try to think about it and reply later today

@cmbz cmbz moved this from SPRINT- NEEDS SIZING to On Hold ⌛ in IQSS Dataverse Project Nov 21, 2024
@cmbz
Copy link
Contributor Author

cmbz commented Nov 21, 2024

2024/11/21: Placing On Hold until @sbarbosadataverse and @jggautier figure out which vocabulary they want to investigate.

@bencomp
Copy link

bencomp commented Jan 13, 2025

May I offer the idea that using a hierarchical vocabulary should help finding data even without additional visuals? E.g., if I tag a dataset "European politics", I should be able to find it when I search for the broader term "politics" (assuming the use of a vocabulary that includes those terms in a hierachical relationship).

Just chiming in since this issue replaces ones of the oldest Dataverse issues, while it appears to not cover all of the old issue's contents.

@jggautier
Copy link

Thanks @bencomp. Could you write more about what additional visuals might mean?

@bencomp
Copy link

bencomp commented Jan 13, 2025

For a moment I thought this issue was only/mostly about the visual navigation of a hierarchy, but on a second read I see that was my mistake.

@jggautier
Copy link

Ah, okay, visuals like how depositors and curators might select terms from a hierarchical vocabulary. Thanks!

Yeah we definitely mean to consider all aspects of "support", like what was discussed in the older GitHub issues that this issue replaced.

@sbarbosadataverse
Copy link

sbarbosadataverse commented Jan 21, 2025

Status: January 2025

@cmbz @qqmyers

Julian and I met and discussed some tasks associated with Jim's plan:

  1. It's still important to add to Jim's plan to keep in mind how people will use these vocabularies to search, per @bencomp comment

  2. Determine which vocabulary is more complex, which is used more often, how easy it is to access the terms, and would interact better with controlled vocab functionality

  3. Are there already platforms allowing users access to these terms, to use for our mockup examples

  4. The vocabulary should be accessible to all users and not within blocks - as Jim pointed out this would make development more complicated but the goal is for HDV wide-use -- @qqmyers

In addition to what Jim outlined, and to happen in parallel:

  1. Learning about how people are already using or plan to use these terms in DV (e g. the MORU collection, AfricArxiv, HEAL)

  2. Consider use cases for those wanting to use multiple controlled vocabulary in the same field (In Keyword field, example) - @qqmyers Would it be problematic to build support for one vocabulary, and then modify to support "multiple controlled vocabulary," later? Should we consider a "multiple vocabulary support" model to start? @jggautier can share the community conversation on this multiple vocab support (is someone in the community already supporting this? We can email the installations and ask?)

@cmbz
Copy link
Contributor Author

cmbz commented Jan 23, 2025

Thanks @sbarbosadataverse and @jggautier looks like a great plan to me. Curious about @qqmyers thoughts?

@qqmyers
Copy link
Member

qqmyers commented Jan 23, 2025

Not sure what to comment on: re: 4 - not sure why implementation in a new block can't available site-wide, but, if the idea is to have this in the citation block Keywords field - it would be required to be on for everyone (so non-medical collections would have to see any medical vocab).

re the second: The way our ext. vocab service currently works is that there can be one script per field. That means that if you want to support one hierarchical vocab and free-text entries, the script has to support that (most of our current ones do) and if you want multiple vocabs, again the script has to support it (currently only our skosmos script does that and it requires both vocabs to be on the same server.) Same for multiple vocabs and free text - that would all be built into a single script.

There is interest in the community in allowing multiple scripts on a given field and even allowing different scripts to be turned on for a given field in different collections. If/when that is designed/implemented, individual scripts could probably stop doing anything to handle free text or multiple vocabs. Which ~means that starting with single vocab per field is fine/it's extra work to support multiple vocabs and, until there's a clear design, work towards multiple vocabs in one field could end up being one-off/have to be redone later. (I don't have a good guess as to when redesign might get going - probably faster if Harvard is also interested due to GREI).

@sbarbosadataverse
Copy link

sbarbosadataverse commented Feb 3, 2025

  • @jggautier will review both vocabs (UMLS and MeSH) to determine: 1 - How complex large/complex they are, 2- Where they are hosted, 3 - How often they are used.

  • We decided to start by considering one vocabulary for implementation and work on multiple vocabulary support later.

  • Communities to contact for info on use: MORU, HEAL, AfricArXiv

@jggautier
Copy link

jggautier commented Feb 11, 2025

I'm adding info in this comment and will continue updating it as I learn it about MeSH and UMLS. They include some assumptions that I can verify or correct as we learn from groups that we think would be interested in or benefit from using these vocabularies to describe what they publish in Dataverse repositories (MORU, HEAL, AfricArXiv) and as we learn from groups who have already used these vocabularies.

An early assumption is that depositors, curators and other types of users are using Dataverse's Keyword and Topic Classification fields to use terms from MeSH and to use UMLS to describe deposits published in Dataverse repositories, so I'm including questions and assumptions about those two fields.

MeSH

Purpose of MeSH
MeSH, or Medical Subject Headings, was created to support searching by topic for biomedical and health-related information and documents. It's used in the MEDLINE database of journal citations and NLM's retrieval systems, such as PubMed, "a free full-text archive of biomedical and life sciences journal literature at the US National Institutes of Health's National Library".

Size and complexity of MeSH

  • I'm not sure about the number of terms
  • The terms have parent-child relationships, synonymous or related relationships, and preferred terms
  • I'm not sure about the depth of the hierarchy of terms, but I've seen depths of 5
  • The same term - or different concepts represented by the same term or label - can be a child of multiple parent terms or concepts. See the "tree view" at https://meshb.nlm.nih.gov/treeView

The MeSH model is also summarized on a page in the UMLS site. See https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MSH

Where are the terms hosted?

How and how often are the terms used in Dataverse repositories?
As of August 2024, 29 known Dataverse installations have published at least 683 deposits where MeSH terms were added in the Keyword and Topic Classification fields.

Here are the 10 installations that have published most of these deposits:
Image

I found those 683 datasets by looking for datasets where:

  • a MeSH URL was entered in the Keyword Term, Keyword Term URI, Keyword Controlled Vocabulary URL, Topic Classification Term, Topic Classification Controlled Vocabulary Name, and Topic Classification Controlled Vocabulary URL
  • "mesh" was entered (case insensitive) in the Keyword Controlled Vocabulary Name and Topic Classification Controlled Vocabulary Name fields

The Keyword fields are used more than the Topic Classification fields, and sometimes both fields are used in the same deposit.

MeSH terms entered in Keyword fields in a dataset, https://doi.org/10.15139/S3/V8G3QG, published in UNC Dataverse
Image

MeSH terms entered in Topic Classification fields in a dataset,https://doi.org/10.18419/DARUS-4230, published in DaRUS
Image

MeSH terms entered in the Keyword and Topic Classification fields in a dataset, https://hdl.handle.net/20.500.12682/rdp/CN73BH, published in DOMUS Dados
Image

Why do people use MeSH terms and how do people choose terms?
Trained catalogers use MeSH to describe biomedical and life sciences journal articles that are indexed in MEDLINE, the largest subset of PubMed, and to make those articles easier to find. An algorithm called the Medical Text Indexer-NeXt Generation algorithm is used to choose MeSH terms that describe articles in MEDLINE. Then human catalogers review those terms before they're applied to the articles. See https://support.nlm.nih.gov/kbArticle/?pn=KA-05326

Julian's assumptions:
When researchers and curators have used the Dataverse metadata fields to enter MeSH terms, they:

  • _Have used the tools on NLM's websites, such as the MeSH browser website, to find terms related to their data. Then they pasted the term values and links into the keyword and topic classification fields. Brown University's library has a guide at https://libguides.brown.edu/c.php?g=1104782&p=8054880 about using those tools to find MeSH terms _
  • Are using MeSH terms that have already been used to describe articles in PubMed that are associated with the data they're publishing
  • Have learned about and used MeSH while looking for articles in PubMed and think that they can also use MeSH to help others find the data they're publishing

Who to contact to learn more?

  • @sbarbosadataverse wrote that the managers of the MORU, HEAL, and AfricArXiv collections have either expressed or would be interested in using the MeSH vocab to describe what they publish in Dataverse repositories.

  • Jessica Sedgwick may be the most responsive contact to learn about the use of MeSH terms in the Center for the History of Medicine Dataverse, which is the collection in Harvard Dataverse with the most datasets where MeSH terms are used. In 2020 Jessica and I emailed each other about the collection.

  • Since the Center for the History of Medicine is affiliated with the Harvard's Countway Library, it might be worthwhile to contact Jessica Pierce and maybe the LMA Research Data Management Working Group for feedback about the use of MeSH terms for describing deposits in Harvard Dataverse.

  • Scott Lapinski leads the Publishing & Data Services department at Countway Library at Harvard and is familiar with how MeSH is used to describe journal articles.

  • Andrew Creamer manages the Brown University Dataverse and NeuroNex Bioluminescence Hub Dataverse, which have the second-most datasets where MeSH terms are used. In those collections, MeSH terms are used in both the Keyword field and the Topic Classification field. I spoke with Andrew in 2019 when he first asked about using Harvard Dataverse for Brown University (https://help.hmdc.harvard.edu/Ticket/Display.html?id=277149). @sbarbosadataverse also emailed Andrew in 2020 about CoreTrustSeal certification, and his replies about how the collection is managed may include helpful context. He also emailed support in April 2024 for help with uploading files.

  • Philipp Conzett helps manage DataverseNO and during an interview in mid-February 2025 he shared slides about the installation's curation practices and that "Technically, it is already possible to link various metadata fields to external controlled vocabularies/ontologies, e.g. Keyword to MeSH", that "when we have better capacity, we will follow this up", and that "This has also been included as a task in the NFR INFRASTRUCTURE application."

  • Have the folks who lead the other GREI repositories, especially Vivli, considered the use of MeSH or similar vocabularies, e.g. in the UMLS, for describing studies they publish?

UMLS

Purpose of UMLS
UMLS, or Unified Medical Language System, "is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems."

Size and complexity of UMLS
The vocabularies include MeSH, CPT, ICD-10-CM, LOINC, RxNormOne, and SNOMED CT. SNOMED CT is the world's "most comprehensive and precise, multilingual health terminology", used in "Electronic Heath Record software applications to represent relevant clinical information consistently" and "help exchange clinical health information between systems". See https://www.nlm.nih.gov/healthit/snomedct.

For the full list of the approximately 100 vocabularies in UMLS, see the UMLS Vocabulary Documentation page

UMLS includes relationships among the vocabularies' terms or concepts. Mappings between terms are described on the UMLS Metathesaurus - Mapping Projects page.

Where are the terms hosted?
See the "Accessing the UMLS" section of the UMLS home page, which includes the UMLS API Technical Documentation

Why do people use UMLS?
Julian's assumptions:
UMLS is not one list of terms and their relationships, like MeSH is, but a collection of terms from different vocabularies; relationships or mappings between those terms; language translations; and software infrastructure to access those terms and their relationships. So the purpose of exploring Dataverse "support" of UMLS is different than the purpose of exploring "support" for MeSH and other "hierarchical vocabularies".

Whereas hierarchical vocabulary support might include making it easier for users to add terms and improving search by relying on relationships between terms in the same vocabulary, such as parent-child and synonymous relationships, "support" for UMLS might include relying on the relationships between terms in different vocabularies. For example, if I find a dataset that's described using a SNOMED CT vocabulary term, can I find other datasets described using equivalent or narrower terms in the ICD-10-CM vocabulary and other related vocabularies?

Whoever included UMLS might also have imagine "UMLS support" as a way to support multiple hierarchical vocabularies instead of only MeSH.

Who to contact to learn more?

  • @sbarbosadataverse wrote that the managers of the MORU, HEAL, and AfricArXiv collections have either expressed or would be interested in using UMLS vocabulary to describe what they publish in Dataverse repositories.

  • It might be worthwhile to contact Jessica Pierce and maybe the LMA Research Data Management Working Group for feedback about the use of UMLS for describing deposits in Harvard Dataverse.

Keyword and Topic Classification metadata fields

Why do people decide to use the keyword or topic classification fields?
Julian's assumptions:

  • Some users aren't sure of the difference between the fields, so they use both just in case (see Unresolved feedback from community review of Citation metadata fields dataverse#8467)
  • The keyword fields are used more often than the topic classification fields because the keyword fields appear on the "create" page while the topic classification fields do not, so some users never see the topic classification fields if they don't edit the dataset they just created or because they don't notice the topic classification fields among the rest of the fields that appear on the "edit" page.

The keyword and topic classification fields are influenced by properties or elements in the DDI Codebook standard. How does that standard's maintainers define these properties? How do they think the two properties are different? How do they expect others to use them?
Keyword definition in Codebook 2.1 and Codebook 2.5:

Words or phrases that describe salient aspects of a data collection's content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Maps to Dublin Core Subject element. The "vocab" attribute is provided for specification of the controlled vocabulary in use, e.g., LCSH, MeSH, etc. The "vocabURI" attribute specifies the location for the full controlled vocabulary.

Topic Classification definition in Codebook 2.1 and Codebook 2.5:

The classification field indicates the broad substantive topic(s) that the data cover. Library of Congress subject terms may be used here. The "vocab" attribute is provided for specification of the controlled vocabulary in use, e.g., LCSH, MeSH, etc. The "vocabURI" attribute specifies the location for the full controlled vocabulary. Maps to Dublin Core Subject element. Inclusion of this element in the codebook is recommended.

@cmbz
Copy link
Contributor Author

cmbz commented Feb 14, 2025

Status: February 2025

  • Pending

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataverse Project Issues related to Dataverse Project software GREI Year 3 Year 3 GREI task GREI 2 Consistent Metadata Harvard Dataverse Issues related to Harvard Dataverse Repository Project: NIH GREI Tasks related to the NIH GREI project
Projects
Status: On Hold ⌛
Development

No branches or pull requests

7 participants