Skip to content

Generalizability

Tiffany J. Callahan edited this page Aug 18, 2021 · 10 revisions


Purpose

Study Data

OHDSI Study Protocol: OHDSI_Study_Protocol_v1.0

Collaborators: Anna Ostropolets and Patrick Ryan

This study aimed to evaluate and characterize the generalizability or coverage of the Observational Medical Outcomes Partnership (OMOP) vocabulary terms included in the OMOP2OBO mapping set to OMOP vocabulary terms utilized in the Observational Health Data Sciences and Informatics (OHDSI) Concept Prevalence study sites.

As described here, the Concept Prevalence study was designed to provide researchers with additional context regarding the frequency at which different clinical codes occur across the OHDSI research network:

We want to study the usage patterns of Concepts across different OMOP CDM instances. This in itself could be useful information to answer many questions, but we have a concrete reason: For any one medical entity, the granularity of codes captured in a data source can vary greatly. For example, Chronic Kidney Disorder stage II can be coded as ICD9 code 585.2 Chronic kidney disease, Stage II (mild); 585.9 Chronic kidney disease, unspecified or even as 586 Renal failure, unspecified. However, this information is key for any cohort definition. Currently, researchers have no way of knowing whether a certain concept with high granularity is even available for selection, or whether they have to use a generic concept in combination with some auxiliary information to define the cohort correctly. Each data source instance is a black box and knowledge about the distribution of the concepts is limited to the very instance researchers have access to. But OHDSI Network Studies are dependent on cohort definitions that work across the network.


Wiki Organization





Main Analysis


The main research question for this portion of the evaluation was: how does the coverage of the OMOP vocabulary terms present in the OMOP2OBO mappings differ across the OHDSI Concept Prevalence study sites?

The specific aims of this study were as follows:

  • Examine OMOP2OBO coverage across the Concept Prevalence sites by identifying:
    • OMOP vocabulary terms that exist in OMOP2OBO and one or more site
    • OMOP vocabulary terms only present in OMOP2OBO and none of the Concept Prevalence sites
    • OMOP vocabulary terms only present in one or more the site
  • Demonstrate the potential for [molecular] biological inference of OMOP2OBO by characterizing differences in ontology term enrichment across the Concept Prevalence sites when varying different aspects of data provenance (e.g., site type, clinical specialty, and site location).

Study Sites

In addition to the Concept Prevalence study sites (n=22), data was obtained from two independent academic medical centers. High-level descriptions of each site, including the total number of records and concepts are provided below.

Database Type Location Record Count Concept Count
Ajou University Database (Ajou) EHR Non-US 30,238,709 6,055
Australian Electronic practice based research network (AU-ePBRN) EHR Non-US 11,658,378 5,027
Columbia University Medical Center Database (CUMC) EHR US 938,078,465 21,502
IBM MarketScan Commercial Database (CCAE) CLAIMS US 12,649,562,658 31,570
IBM MarketScan Medicare Supplemental Database (MDCR) CLAIMS US 2,770,787,154 25,121
IBM MarketScan Multi-State Medicaid Database (MDCD) CLAIMS US 4,283,172,117 19,133
IQVIA Disease Analyzer (DA) France EHR Non-US 39,632,134 3,423
IQVIA Disease Analyzer (DA) Germany EHR Non-US 851,853,377 9,276
IQVIA Longitudinal Patient Data (LPD) Australia EHR Non-US 56,940,803 5,833
IQVIA US Ambulatory EMR (AmbEMR) EHR US 10,634,058,375 62,161
IQVIA US Hospital Charge Data Master (CDM) EHR US 4,857,228,360 19,352
IQVIA US LRxDx Open Claims (Open Claims) CLAIMS US 71,678,847,042 20,083
Japan Medical Data Center database (JMDC) EHR Non-US 1,184,325,523 6,833
Korea National Health Insurance Service / National Sample Cohort (NHIS/NSC Korea) CLAIMS Non-US 323,096,899 6,667
Medical Information Mart for Intensive Care III (MIMIC3) EHR US 124,127,038 3,781
Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status (SES) CLAIMS US 13,369,194,028 36,943
Optum De-Identified Clinformatics Data-Mart-Database—Date of Death (DOD) CLAIMS US 9,716,879,363 34,853
Optum De-identified Electronic Health Record Dataset (PANTHER) EHR US 27,894,204,112 59,777
Premier Healthcare Database (PREMIER) CLAIMS US 16,794,698,039 18,903
Stanford Medicine Research Data Repository (STaRR) EHR US 416,175,821 11,161
The Healthcare Cost and Utilization ProjectNationwide Inpatient Sample (HCUP) EHR US 744,807,853 9,391
Tufts Medical Center Database (Tufts) EHR US 66,863,985 21,118
UCHealth EHR US 1,215,613,326 19,073
USC PScanner EHR US 29,703,213 11,476

Data

For each data site, standard concepts used at least once in practice were obtained from the Condition Occurrence (i.e. SNOMED-CT), Drug Exposure (i.e. ingredient-level; RxNorm), and Measurement (i.e. LOINC) tables. For all concepts, the total frequency was obtained and consistent with the Concept Prevalence study, all concepts occurring fewer than 10 times were ignored and all remaining concepts occurring fewer than 100 times were assigned a count of 100.



Error Analysis


SQL Query: OMOP2OBO_ConceptPrevalence_ErrorAnalysis.sql

An error analysis was performed to help provide insight into the Concept Prevalence study concepts that were not covered by the OMOP2OBO mapping sets. The OMOP2OBO mapping set was created off of the OMOP common data model (CDM) v5.0, which contained vocabulary concepts with a timestamp of June 26,2018. Given how quickly the vocabulary changes, we hypothesized that some of the concepts that were were unable to cover could be brand new concepts and/or concepts which have been updated or replaced by pre-existing concepts.

To perform this analysis, the following SQL query was against a current version of the OMOP CDM:

SELECT
  DISTINCT r.relationship_id,
  c1.concept_id AS SOURCE_CONCEPT_ID,
  c1.concept_name AS SOURCE_CONCEPT_LABEL,
  c2.concept_id AS TARGET_CONCEPT_ID,
  c2.concept_name AS TARGET_CONCEPT_LABEL,
FROM
  sandbox-omop.oct_2020.concept_relationship r
  JOIN sandbox-omop.oct_2020.concept c1 ON c1.concept_id = r.concept_id_1
  JOIN sandbox-omop.oct_2020.concept c2 ON c2.concept_id = r.concept_id_2
WHERE
  r.concept_id_1 IN (SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.`OMOP2OBO`_Conditions_Concepts_Merged
                      UNION DISTINCT
                     SELECT ingredient_concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.`OMOP2OBO`_Medications_Concepts_Merged
                      UNION DISTINCT
                     SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.`OMOP2OBO`_Measurements_Concepts_Merged)
  AND r.relationship_id IN ("Concept replaced by", "Maps to", "Concept same_as from", "Concept poss_eq from", "Concept was_a from", "Is a")
  AND (r.valid_start_date > '2018-06-26' AND r.valid_start_date < '2020-10-17')
ORDER BY r.relationship_id;

The relationship_id column contains different relationships that can be utilized to explain the relationship between OMOP concept-ids. The relationship_ids included in the query above are organized such that they allow us to identify two types of scenarios:

  1. Newly Added Concepts: Concepts that did not exist in the version of the OMOP CDM used to create the OMOP2OBO mappings, but that do exist in the current CDM
  2. Updated Concepts: Concepts that existed in the version of the OMOP CDM used to create the OMOP2OBO mappings, but which have been updated and now exist under a new concept_id.

The table below organizes the OMOP CDM relationship_ids by scenario.

Relationship_ID Scenario Type
Newly Added Concepts Maps to
Newly Added Concepts Concept poss_eq from (synonyms)
Newly Added Concepts Concept same_as from (synonyms)
Newly Added Concepts Concept was_a from (concept type)
Newly Added Concepts Is a (concept type)
Replaced Concept Concept replaced by


Analysis


We used this information to categorize uncovered concepts (i.e., concepts included in the Concept Prevalence data sets, but missing from the OMOP2OBO mapping set). Specifically, for each clinical domain we obtained three lists:

  1. Uncovered concepts in the error analysis data
  2. Uncovered concepts in the OMOP2OBO mapping data, but ineligible for mapping
  3. Uncovered concepts that were truly unable to be accounted for by existing data sources

For lists 1 and 2, we aimed to explain the uncovered concepts by categorizing them according to an explanation for their missingness (i.e., concept present in newer OMOP vocabulary or replaced concept). For all the lists, we also obtained prevalence information for each concept as the frequency of use within and across the Concept Prevalence data sites, which was used as metric to measure the importance of each uncovered concept.



Results


Results are presented below by clinical domain. As shown in Figure 1, the OMOP vocabulary terms included in the OMOP2OBO mapping set provided exceptional coverage, which differed both by Concept Prevalence study site and clinical domain.

Figure 1: OMOP2OBO - Concept Prevalence Coverage

Figure presents the coverage of the OMOP2OBO mappings using Concept Prevalence Study data, where the distribution of the Overlap (i.e., OMOP concepts that exist in OMOP2OBO only sets and one or more Concept Prevalence sites), Concept Prevalence only and OMOP2OBO sets are shown on the left. On the right, the Error Analysis Concepts (i.e., concepts that can be accounted for in a newer OMOP CDM version), Excluded Set (i.e., purposefully or not yet mapped concepts), and Truly Missing (i.e., the concept’s missingness cannot easily be accounted for). These distributions were created for condition concepts (A and E), drug ingredients (B and D), and measurement (C and F) results. Click on figure to enlarge it.


CONDITIONS


The OHDSI Concept Prevalence data contained 62,335 unique OMOP condition vocabulary concepts from 24 sites. After filtering the OMOP2OBO mappings to remove all entries where all ontologies were "NONE" or "NOT YET MAPPED" and all non-standard concepts, 92,367 concepts remained eligible for use in the coverage study. This means that all purposefully unmapped concepts (i.e., findings, injuries, complications, and carrier status) were kept within the data set as long as at least one of the other mapped ontologies for the given concept was not an unmapped concept of type "NOT YET MAPPED". These data were utilized for all condition coverage experiments.

The OMOP2OBO condition set contained 92,367 OMOP condition concept ids, which covered 92.51% (weighted coverage: 99.46%) of the 62,335 Concept Prevalence condition concepts. There were 34,704 OMOP2OBO concepts that were not included in the Concept Prevalence set and 4,672 Concept Prevalence concepts that were not covered by the OMOP2OBO mappings. These findings are organized into three sets and visualized in Figure 1 (A):

  • Overlap: 57,663 OMOP2OBO concepts (26,807 Concepts Used in Practice, 30,856 Standard Concepts Not Used in Practice) existed in OMOP2OBO and Concept Prevalence. On average, these concepts occurred 526.96 times (100.0-87,285,164.39).
  • OMOP2OBO Only: 34,704 OMOP2OBO concepts (2,272 Concepts Used in Practice, 32,432 Standard Concepts Not Used in Practice) existed only in the OMOP2OBO set. On average, these concepts occurred 131.65 times (100.0-39,975.0).
  • Concept Prevalence Only: 4,672 OMOP concepts existed only in the Concept Prevalence set. On average, these concepts occurred 173.57 times (100.0-8,254,186.5).

Coverage by Site This phase of the experiment aimed to demonstrate the coverage of the OMOP2OBO condition occurrence concepts for each Concept Prevalence study site (Figure 2). Across the Concept Prevalence study sites, coverage ranged from 93.04-99.69%. A Chi-Square test of independence with Yate's correction was run and revealed a significant association between the database and coverage (X2(23) = 7,559.11, p<0.0001). In order to better understand these findings, post-hoc tests were run using a Bonferroni adjustment and confirmed that 107 of the 276 database comparisons had significantly different OMOP2OBO coverage (ps<0.001).

Error Analysis
The results are visualized in Figure 1 (D). Of the 4,672 concepts not covered by OMOP2OBO, 367 could be accounted for by a newer version of the OMOP CDM (i.e., Error Analysis Concepts), 4,231 accounted for in the set of excluded mappings from the original mapping set (i.e., Excluded Concepts), and 74 concepts were missing and unable to be explained by existing data sources (i.e., Truly Missing Concepts). Additional details on each of these concept sets is provided below.

  • Error Analysis Concepts: A total of 367 (7.86%) missing concepts were accounted for using the current version of the OMOP CDM using the OMOP concept_relationship table. These concepts occurred in an average of 2.64 Concept Prevalence study sites with a mean frequency of 27,412.262 (100-3,539,698.5). The 367 missing concepts could be traced to 1,423 source_concept_ids in the original OMOP2OBO map set using the following relationship_ids: Is a (n=1,225), Maps to (n=167), and Concept replaced by (n=31).

  • Excluded Concepts: A total of 4,231 (90.56%) OMOP concepts could be found in the set of data which were initially filtered from the original OMOP2OBO mapping set. These concepts occurred in an average of 1.65 Concept Prevalence study sites and had a mean frequency of 6,139.32 (100-8,254,186.5). These concepts were initially excluded for one of the following reasons:

    • 3,400 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with HP type "NOT YET MAPPED" and MONDO type "FINDING"
    • 796 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with HP and MONDO type "NOT YET MAPPED"
    • 35 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with HP and MONDO type "NONE"
  • Truly Missing Concepts: A total of 74 (1.58%) OMOP concepts were truly missing. These concepts occurred in an average of 2.74 Concept Prevalence study sites and had a mean frequency of 5,320.06 (100-100,483). The top five most frequently occurring missing concepts were (with average frequency across the 24 sites and number of sites with concept):

    1. increased fluid intake (n=100,483; 1 site)
    2. disease caused by 2019-nCoV (n=93,585; 1 site)
    3. polycystic ovary syndrome (n=62,900.33; 3 sites)
    4. saddle embolus of pulmonary artery with acute cor pulmonale (n=22,324.40; 10 sites)
    5. adjustment disorder with mixed anxiety and depressed mood (n=18,453; 1 site)

    Domain expert review of these concepts found that they were likely missing as a result of being infrequently diagnosed in pediatric populations.

Database Indices - 1: Ajou University Database; 2: IQVIA US Ambulatory Electronic Medical Record; 3: IQVIA Longitudinal Patient Data Australia; 4: IQVIA Disease Analyzer France; 5: IQVIA Disease Analyzer Germany; 6: The Healthcare Cost and Utilization Project Nationwide Inpatient Sample; 7: IQVIA US Hospital Charge Data Master; 8: IBM MarketScan Commercial Database; 9: IBM MarketScan Multi-State Medicaid Database; 10: IBM MarketScan Medicare Supplemental Database; 11: Japan Medical Data Center database; 12: Medical Information Mart for Intensive Care III; 13: Korea National Health Insurance Service/National Sample Cohort; 14: Optum De-Identified Clinformatics Data-Mart-Database—Date of Death; 15: Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status; 16: Optum De-identified Electronic Health Record Dataset; 17: IQVIA US LRxDx Open Claims; 18: Premier Healthcare Database; 19: University of Southern California PScanner; 20: Stanford Medicine Research Data Repository; 21: Tufts Medical Center Database; 22: University of Colorado Anschutz Medical Campus Health Group; 23: Australian Electronic Practice-based Research Network; 24: Columbia University Medical Center Database

Figure 2. OMOP2OBO Coverage of Condition Concepts by Concept Prevalence Site

(A) Across the Concept Prevalence study sites, coverage ranged from 93.04-99.69%. A Chi-Square test of independence with Yate's correction revealed a significant association between the site and coverage (p<0.0001). (B) Post-hoc tests with Bonferroni adjustment to correct for multiple comparisons confirmed that 107 of the 276 database comparisons had significantly different coverage (ps<0.001). (C) Frequency of covered OMOP2OBO concepts at each Concept Prevalence site. (D) Frequency of Concept Prevalence site concepts not covered by OMOP2OBO. Click on figure to enlarge it.



DRUG EXPOSURE INGREDIENTS


The OHDSI Concept Prevalence data contained 4,588 unique OMOP vocabulary concepts from 18 sites. The OMOP2OBO vocabulary concepts from each of these sites was compared to the list of concepts from the OMOP2OBO mappings. After filtering the OMOP2OBO mappings to remove all entries where all ontologies were "NONE" or "NOT YET MAPPED" and all non-standard concepts, 8,615 concepts remained eligible for use in the coverage study. These data were utilized for all drug ingredient coverage experiments.

The OMOP2OBO drug ingredient set contained 8,615 OMOP drug ingredient concept ids, which covered 87.99% (weighted coverage: 99.92%) of the 4,588 Concept Prevalence drug ingredient concepts. There were 4,578 OMOP2OBO concepts that were not included in the Concept Prevalence set and 551 Concept Prevalence concepts that were not covered by the OMOP2OBO mappings. These findings are organized into three sets and visualized in Figure 1 (B):

  • Overlap: 4,037 OMOP2OBO concepts (1,639 Concepts Used in Practice, 2,398 Standard Concepts Not Used in Practice) existed in OMOP2OBO and Concept Prevalence. On average, these concepts occurred 8,071.59 times (100.0-125,634,570.39).
  • OMOP2OBO Only: 4,578 OMOP2OBO concepts (58 Concepts Used in Practice, 5,520 Standard Concepts Not Used in Practice) existed only in the OMOP2OBO set. On average, these concepts occurred 468.89 times (100.0-69,311.0).
  • Concept Prevalence Only: 551 OMOP concepts that existed only in the Concept Prevalence set. On average, these concepts occurred 801.2 times (100.0-1,795,364.83).

Coverage by Site
This phase of the experiment aimed to demonstrate the coverage of the OMOP2OBO condition occurrence concepts for each Concept Prevalence study site (Figure 3). Across the Concept Prevalence study sites, coverage ranged from 91.23-98.35%. A Chi-Square test of independence with Yate's correction revealed a significant association between the database and coverage (X2(17)=195.640, p<0.0001). In order to better understand these findings, post-hoc tests were run using a Bonferroni adjustment and confirmed that 53 of the 153 database comparisons had significantly different OMOP2OBO coverage (ps<0.001).

Error Analysis
Results are visualized in Figure 1 (E). Of the 551 concepts not covered by OMOP2OBO, five could be accounted for by a newer version of the OMOP CDM (i.e., Error Analysis Concepts), 456 could be accounted for in the set of excluded mappings from the original mapping set (i.e., Excluded Concepts), and 90 concepts were missing and unable to be explained by existing data sources (i.e., Truly Missing Concepts). Additional details on each of these concept sets is provided below.

  • Error Analysis Concepts: A total of five (0.91%) missing concepts were accounted for using the current version of the OMOP CDM using the OMOP concept_relationship table. These concepts occurred in an average of 8.4 Concept Prevalence study sites and had a mean frequency of 51,732.04 (100-221,229.71). The five missing concepts could be traced to six source_concept_ids in the original OMOP2OBO map set using the Maps to (n=6) relationship.

  • Excluded Concepts: A total of 456 (82.76%) OMOP concepts could be found in the set of data which were initially filtered from the original OMOP2OBO mapping set. These concepts occurred in an average of 3.88 Concept Prevalence study sites and had a mean frequency of 18,847.28 (100-1,077,258.9). These concepts were initially excluded for one of the following reasons:

    • 456 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with CHEBI, PRO, NCBITaxon, and VO type "NOT YET MAPPED"
  • Truly Missing Concepts: A total of 90 (16.33%) OMOP concepts were truly missing. These concepts occurred in an average of 2.66 Concept Prevalence study sites and had a mean frequency of 3,361.15 (100-175,551.29). The top five most frequently occurring missing concepts were (with average frequency across the 24 sites and number of sites with concept):

    1. hepatitis A virus strain CR 326F antigen, inactivated (n=175,551.29; 14 sites)
    2. erenumab (n=60,618; 10 sites)
    3. fremanezumab (n=15,579.60; 5 sites)
    4. galcanezumab (n=11,594.80; 5 sites)
    5. baloxavir marboxil (n=11,366.68; 3 sites)

    Domain expert review of these concepts found that they were likely missing as a result of hospital vendor differences or were new high-risk biologics whose safety and efficacy had not yet been tested or confirmed in pediatric populations.

Database Indices - 1: IQVIA US Ambulatory Electronic Medical Record; 2: IQVIA Longitudinal Patient Data Australia; 3: IQVIA Disease Analyzer Germany; 4: IQVIA US Hospital Charge Data Master; 5: IBM MarketScan Commercial Database; 6: IBM MarketScan Multi-State Medicaid Database; 7: IBM MarketScan Medicare Supplemental Database; 8: Japan Medical Data Center database; 9: Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status; 10: Optum De-identified Electronic Health Record Dataset; 11: Optum De-identified Electronic Health Record Dataset; 12: Premier Healthcare Database; 13: University of Southern California PScanner; 14: Stanford Medicine Research Data Repository; 15: Tufts Medical Center Database; 16: University of Colorado Anschutz Medical Campus Health Group; 17: Australian Electronic Practice-based Research Network; 18: Columbia University Medical Center Database.

Figure 3. OMOP2OBO Coverage of Drug Exposure Ingredient Concepts by Concept Prevalence Site

(A) Across the Concept Prevalence study sites, coverage ranged from 91.23-98.35%. A Chi-Square test of independence with Yate's correction revealed a significant association between the site and coverage (p<0.0001). (B) Post-hoc tests with Bonferroni adjustment to correct for multiple comparisons confirmed that 53 of the 153 database comparisons had significantly different coverage (ps<0.001). (C) Frequency of covered OMOP2OBO concepts at each Concept Prevalence site. (D) Frequency of Concept Prevalence site concepts not covered by OMOP2OBO. Click on figure to enlarge it.



MEASUREMENTS


The OHDSI Concept Prevalence data contained 23,513 unique OMOP vocabulary concepts from 18 sites. The OMOP2OBO vocabulary concepts from each of these sites was compared to the list of concepts from the OMOP2OBO mappings. After filtering the OMOP2OBO mappings to remove all entries where all ontologies were "NONE", "UNSPECIFIED SAMPLE" or "UNMAPPED TEST TYPE" and all non-standard concepts, 3,827 concepts (10,673 lab test results) remained eligible for use in the coverage study. These data were utilized for all measurement result coverage experiments.

The OMOP2OBO measurement result set contained 3,827 OMOP measurement concept ids (10,673 lab test results), which covered 11.14% (weighted coverage: 67.72%) of the 23,513 Concept Prevalence concepts. There were 1,207 OMOP2OBO concepts that were not included in the Concept Prevalence set and 20,893 Concept Prevalence concepts were not covered by the OMOP2OBO mappings. These findings are organized into three sets and visualized in Figure 1 (C):

  • Overlap: 2,620 OMOP2OBO concepts (1,393 Concepts Used in Practice, 1,207 Standard Concepts Not Used in Practice) existed in OMOP2OBO and Concept Prevalence. On average, these concepts occurred 3,072.33 times (100.0-183,333,482.38).
  • OMOP2OBO Only: 1,207 OMOP2OBO concepts (42 Concepts Used in Practice, 1,164 Standard Concepts Not Used in Practice) existed only in the OMOP2OBO set. On average, these concepts occurred 346.92 times (100.0-,842,485.0).
  • Concept Prevalence Only: 20,893 OMOP concepts that existed only in the Concept Prevalence set. On average, these concepts occurred 669.55 times (100.0-1,219,846,862.0).

Coverage by Site This phase of the experiment aimed to demonstrate the coverage of the OMOP2OBO condition occurrence concepts for each Concept Prevalence study site (Figure 4). Across the Concept Prevalence study sites, coverage ranged from 91.23-98.35%. A Chi-Square test of independence with Yate's correction revealed a significant association between the database and coverage (X2(17) = 195.640, p<0.0001). In order to better understand these findings, post-hoc tests were run using a Bonferroni adjustment and confirmed that 53 of the 153 database comparisons had significantly different OMOP2OBO coverage (ps<0.001).

Error Analysis
Results are visualized in Figure 1 (F). Of the 20,893 concepts not covered by OMOP2OBO, 13 could be accounted for by the current version of the OMOP CDM (i.e., Error Analysis Concepts), 158 were accounted for in the set of excluded mappings from the original mapping set (i.e,. Excluded Concepts), and 20,722 concepts were missing and unable to be explained by existing data sources (i.e., Truly Missing Concepts). Additional details on each of these concept sets is provided below:

  • Error Analysis Concepts: A total of 13 (0.06%) missing concepts could be accounted for by a newer version of the OMOP CDM by tracing their original concept id to their new concept id using the OMOP concept_relationship table. These concepts occurred in an average of 3.23 Concept Prevalence study sites and had a mean frequency of 9,836.25 (100-29,098.2). The 13 missing concepts could be traced to 13 source_concept_ids in the original OMOP2OBO map set using the following relationship_ids: Maps to (n=2) and Concept replaced by (n=11).
  • Excluded Concepts: A total of 158 (0.76%) could be found in the set of data which was initially filtered from the original OMOP2OBO data source. These concepts occurred in an average of 5.18 Concept Prevalence study sites and had a mean frequency of 282,115.28 (100-14,317,951.9). These concepts were initially excluded for one of the following reasons:
    • 76 OMOP concepts had an "UNSPECIFIED SAMPLE"
    • 79 OMOP concepts had an "UNMAPPED TEST TYPE"
    • 3 OMOP concepts were unable to be mapped to an ontology
  • Truly Missing Concepts: A total of 20,722 (99.18%) missing concepts were truly missing and unable to be accounted for by a current data source. These concepts occurred in an average of 2.82 Concept Prevalence study sites and had a mean frequency of 218,874.03 (100-1,219,846,862). The top five most frequently occurring missing concepts were (with average frequency across the 24 sites and number of sites with concept):
    1. pulse intensity of unspecified artery palpation (n=1,219,846,862, 1 site)
    2. penicillin g potassium [mass] of dose (n=253,609,945, 1 site)
    3. sodium [moles/volume] in saliva (oral fluid) (n=246,641,211, 1 site).
    4. cotinine/creatinine [mass ratio] in urine (n=246,063,202; 1 site)
    5. chloride [moles/volume] in saliva (oral fluid) (n=234,931,483; 1 site).
      Domain expert review of these concepts confirmed that missing concepts were likely due to inconsistencies in the use of LOINC. This finding is consistent with what’s been observed in literature PMID:22306382.

Database Indices - 1: IQVIA US Ambulatory Electronic Medical Record; 2: IQVIA Longitudinal Patient Data Australia; 3: IQVIA Disease Analyzer France; 4: IQVIA Disease Analyzer Germany; 5: IBM MarketScan Commercial Database; 6: IBM MarketScan Medicare Supplemental Database; 7: Japan Medical Data Center database; 8: Medical Information Mart for Intensive Care III; 9: Korea National Health Insurance Service/National Sample Cohort; 10: Optum De-Identified Clinformatics Data-Mart-Database—Date of Death; 11: Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status; 12: Optum De-identified Electronic Health Record Dataset; 13: Premier Healthcare Database; 14: University of Southern California PScanner; 15: Stanford Medicine Research Data Repository; 16: University of Colorado Anschutz Medical Campus Health Group; 17: Australian Electronic Practice-based Research Network; 18: Columbia University Medical Center Database.

Figure 4. OMOP2OBO Coverage of Measurement Concepts by Concept Prevalence Site

(A) Across the Concept Prevalence study sites, coverage ranged from 4.22-75%. A Chi-Square test of independence with Yate's correction revealed a significant association between the site and coverage (p<0.0001). (B) Post-hoc tests with Bonferroni adjustment to correct for multiple comparisons confirmed that 93 of the 153 database comparisons had significantly different coverage (ps<0.001). (C) Frequency of covered OMOP2OBO concepts at each Concept Prevalence site. (D) Frequency of Concept Prevalence site concepts not covered by OMOP2OBO. Click on figure to enlarge it.