Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding how microbiology laboratory data is stored using the sample data from the Leicester OMOP extract #45

Open
2 of 3 tasks
AngharadGreen opened this issue Nov 19, 2024 · 9 comments
Assignees

Comments

@AngharadGreen
Copy link

AngharadGreen commented Nov 19, 2024

Here I want to:

  • Explore the different tables from the extract to identify which ones contain the relevant data
  • Match concept table to OMOP tables that contain the relevant microbiology data
  • Use this information to complete issue Map OMOP data to inpatient_microbiology table #29
@AngharadGreen
Copy link
Author

AngharadGreen commented Nov 19, 2024

Explore the different tables from the extract to identify which ones contain the relevant data and matching concept table to tables that contain the relevant microbiology data

First look at the following tables to identify microbiology data:

  • Concept
  • Condition_Occurence
  • Specimen
  • Observation
  • Measurement
  • Visit_Occurrence
  • Provider

I read in these four tables as data frames into R:

library(arrow)

concept_df <- read_parquet("concept.parquet")
condition_df <- read_parquet("CONDITION_OCCURRENCE.parquet")
specimen_df <- read_parquet("SPECIMEN.parquet")
observation_df <- read_parquet("OBSERVATION.parquet")
measurement_df <- read_parquet("MEASUREMENT.parquet")
visit_occurence_df <- read_parquet("VISIT_OCCURRENCE.parquet")
provider_df<- read_parquet("VISIT_OCCURRENCE.parquet")

Looking at structure of the concept table - when filtering the concept_name column based on the search term microbiology, this column contains useful information about corresponds to the result of a microbiology lab result and each concept_name has an associated concept_id and a concept_code. There is also a domain_id column that indicates the OMOP table this information would be linked to.

Currently filtering information from the OMOP data frames based on concept IDs related to Microbiology - using E.coli as an example to understand how microbiology lab test results are stored across the different OMOP tables

@AngharadGreen
Copy link
Author

AngharadGreen commented Nov 27, 2024

Installing Ramses using UCLH remote environment

  • Difficult to install Ramses using R-Studio in the DSD environment, due to the older versions of R and R-Studio that were available and not able to access the internet through R to install Ramses via Git-Hub with devtools.
  • Access to the web login version of R-studio was granted which runs R version 4.4.1.
  • Ramses was installed using devtools::install_github("ramses-antibiotics/ramses-package") from the main branch in GitHub, but the mock Ramses database could not be created. This version of Ramses was removed.
  • Followed this issue on Github to Clone the Ramses repo from the developer branch using: devtools::install_github("ramses-antibiotics/ramses-package", ref = "develop")
  • This resulted in an error during installation so installed the package locally from source using this link to download the install file.
  • I then uploaded this file to my web login R studio environment and installed as follows:
    > install.packages("~/OMOP_RAMSES/ramses-antibiotics-ramses-package-v0.7.1-9-g19bea96.tar.gz", repos = NULL, type="source")

@AngharadGreen
Copy link
Author

AngharadGreen commented Nov 27, 2024

Exploring the OMOP extract tables to identify how microbiology lab result data is stored using the web log in version of R-Studio

library(arrow) ## to read in the parquet files
library(dplyr) ## to use the select and unique fundtions

## Read in the OMOP extract tables as data frames
df_concept <- read_parquet("~/data/omop-metadata/data/concept.parquet")
df_measurement <- read_parquet("~/leicester/MEASUREMENT.parquet")
df_condition_occurrence <- read_parquet("~/leicester/CONDITION_OCCURRENCE.parquet")
df_specimen <- read_parquet("~/leicester/SPECIMEN.parquet")
df_visit_occurence <- read_parquet("~/leicester/VISIT_OCCURRENCE.parquet")
df_provider <- read_parquet("~/leicester/PROVIDER.parquet")
df_observation <- read_parquet("~/leicester/OBSERVATION.parquet") ## no data in the OBSERVATION.parquet table provided in extract will look at OBSERVATION_PERIOD.parquet
df_observation_period <- read_parquet("~/leicester/OBSERVATION_PERIOD.parquet")

### Check for microbiology related concept_ids within each table 

## Measurement
# Merge the concept table with the measurement table so only concept ids that are in measurement table are output
Merged_concept_measurement <- merge(x=df_concept, y=df_measurement, by.x = "concept_id", by.y = "measurement_concept_id")

# The resulting merged dataframe is large and the same concept_id is repeated with different measurement IDs, I want to just pull out the unique concept_Ids and concept_name columns from this merged DF to easily identify microbiology related concepts

# Pull out concept_id and concept_name columns from the merged data frame
Concepts_in_measurements <- select(Merged_concept_measurement, "concept_id", "concept_name", "vocabulary_id", "concept_code", "concept_class_id", "domain_id")

# Pull out the unique concept_id and concept_name columns that appear in the measurements_df
Unique_concepts_in_measurements <- unique(Concepts_in_measurements)

## Condition_occurrence
# Merge the concept table with the condition_occurence table so only concept ids that are in the condition_occurence table are output
Merged_concept_condition <- merge(x=df_concept , y=df_condition_occurrence, by.x = "concept_id", by.y = "condition_concept_id")
# Pull out concept_id and concept_name columns from the merged data frame
Concepts_in_condition <- select(Merged_concept_condition, "concept_id", "concept_name", "vocabulary_id", "concept_code", "concept_class_id", "domain_id")
# Pull out the unique concept_id and concept_name columns that appear in the condition_occurence_df
Unique_concepts_in_condition <- unique(Concepts_in_condition) # lots of different bacterial isolates can be found 


## Specimen
# Merge the concept table with the specimen table so only concept ids that are in the specimen table are output
Merged_concept_specimen <- merge(x=df_concept , y=df_specimen, by.x = "concept_id", by.y = "specimen_concept_id")
# Pull out concept_id and concept_name columns from the merged data frame
Concepts_in_specimen <- select(Merged_concept_specimen, "concept_id", "concept_name", "vocabulary_id", "concept_code", "concept_class_id", "domain_id")
# Pull out the unique concept_id and concept_name columns that appear in the specimen_df
Unique_concepts_in_specimen <- unique(Concepts_in_specimen)

## Visit_Occurrence
# Merge the concept table with the visit_occurence table so only concept ids that are in the visit_occurence table are output
Merge_concept_visit <- merge(x=df_concept , y=df_visit_occurence, by.x = "concept_id", by.y = "visit_concept_id")
# Pull out concept_id and concept_name columns from the merged data frame
Concepts_in_visit <- select(Merge_concept_visit,"concept_id", "concept_name", "vocabulary_id", "concept_code", "concept_class_id", "domain_id")
# Pull out the unique concept_id and concept_name columns that appear in the condition_occurence_df
Unique_concepts_in_visit <- unique(Concepts_in_visit) 

## Provider 
# Merge the concept table with the provider table so only concept ids that are in the provider table are output
Merged_concept_provider <- merge(x=df_concept , y=df_provider, by.x = "concept_id", by.y = "specialty_source_concept_id")
# Pull out concept_id and concept_name columns from the merged data frame
Concepts_in_provider <- select(Merged_concept_provider, "concept_id", "concept_name", "vocabulary_id", "concept_code", "concept_class_id", "domain_id")
# Pull out the unique concept_id and concept_name columns that appear in the condition_occurence_df
Unique_concepts_in_provider <- unique(Concepts_in_provider) 

## Observation - no data in current OBSERVATION.parquet table from the extract, have asked for a table with data

## Observation_Period 
# Merge the concept table with the Observation_period table so only concept ids that are in the observation period table are output
Merged_concept_observation_period <- merge(x=df_concept , y=df_observation_period, by.x = "concept_id", by.y = "period_type_concept_id")
# Pull out concept_id and concept_name columns from the merged data frame
Concepts_in_observation_period <- select(Merged_concept_observation_period, "concept_id", "concept_name", "vocabulary_id", "concept_code", "concept_class_id", "domain_id")
# Pull out the unique concept_id and concept_name columns that appear in the condition_occurence_df
Unique_concepts_in_observation_period <- unique(Concepts_in_observation_period) 

  • In the concept table under concept_class_id is a category called organism where names of individual microorganisms are stored under concept_name with the vocabulary_id SNOMED and within the domain_id column they are stored under the observation table. Currently no data is stored in the observation table from the Leicester OMOP extract - this has been requested as this table will be important to pull out information related to different microorganisms identified from lab tests.
  • Measurement contains information related to Observable Entity, Procedure, Staging / Scales, Lab Test and Clinical Observation - the unique concept names from this table are not related to microbiology test results.
  • Condition_Occurrence does contain information related to the condition caused by a microorganism with the vocabulary_id SNOMED that is linked to a concept_code
  • Specimen contains information the type of specimen collected under concept_name with a SNOMED vocabolary_id
  • Visit_occurrence contains information about where the visit took place with options Emergency Room and Inpatient Visit, Inpatient Visit or Emergency Room Visit.
  • Provider contains information related to the profession of the medical person under concept_name with a SNOMED vocabulary_id
  • Observation_period contains information related to electronic health records (EHR)

@AngharadGreen
Copy link
Author

AngharadGreen commented Nov 27, 2024

Understanding how the information about identified microorganisms are stored within the OMOP extract tables and how this related to the Ramses Inpatient_microbiology tables

In Ramses the microbiology isolate and susceptibility information is stored in three separate data frames:

  1. specimens - contains one row per microbial sample sent to the laboratory
  2. isolates - contains zero, one, or multiple rows for every specimen – as many rows as the number of microorganisms isolated from the specimen, whatever the technique (eg mass spectrometry, PCR, bacterial culture)
  3. susceptibility - contains results of susceptibility testing: one row per substance tested.

Ramses has a code to obtain this information and produce these three data frames by using an intermediate a table called inpatient_microbiology. My aim is to re-create this table from the OMOP extract table so it can be input into Ramses using the existing code, the table has the following structure:
'data.frame': 38 obs. of 10 variables:
$ patient_id : chr "99999999999" "99999999999" "99999999999" "99999999999" ...
$ specimen_id : chr "CCC2" "CCC1" "BBB8" "BBB7" ...
$ status : chr NA NA NA NA ...
$ specimen_type_display: chr "MRSA Screen" "Blood Culture" "MRSA Screen" "Faeces" ...
$ specimen_datetime : POSIXct, format: "2017-02-14" "2017-02-13" ...
$ organism_display_name: chr "No growth" "No growth" "No growth" "No growth" ...
$ isolate_id : chr "2300967" "2365356" "2284807" "2392702" ...
$ agent_display_name : chr NA NA NA NA ...
$ rsi_code : chr NA NA NA NA ...
$ isolation_datetime : POSIXct, format: "2017-02-17" "2017-02-16" ...

I want to explore the concept_id table to understand how microbial isolate information is stored.

  • So far we've identified in the concept table under concept_class_id is a category called organism where names of individual microorganisms are stored under concept_name with the vocabulary_id SNOMED. The coencept_code column corresponds to the SNOMED codes.
  • A list of the SNOMED microorganism codes were found here - https://www.cdc.gov/nhsn/xls/master-organism-com-commensals-lists.xlsx, these were saved into a csv file and uploaded to the R environment within the web login R-Studio.
## Read in the concept OMOP tables as a data frame
df_concept <- read_parquet("~/data/omop-metadata/data/concept.parquet")

## Read in the SNOMED organisms codes as a data frame
SNOMED_Organisms_Code <-read.csv("~/OMOP_RAMSES/SNOMED_Organisms.csv", header = TRUE, sep = ",")   

## Filter the concept table to just show the information for the organisms identified via the SNOMED codes

library(tidyverse)

SNOMED_Organisms_in_concept<- filter(df_concept, vocabulary_id == 'SNOMED',
                     concept_code %in% SNOMED_Organisms_Code$SNOMED_Code)

View(SNOMED_Orgnisms_in_concept)

## Check if any other type of domaon appears in  the domain_id column
> table(SNOMED_Organisms_in_concept['domain_id'])
domain_id
Observation 
       2274 
  • This confirms that the organism name is stored with the OMOP observation table as the only domain_id that has been filtered is Observation.
  • This is important as the Ramses inpatient_microbiology table will only accpets microorganisms names in the SNOMED format to input into the AMR package for the organism_display_name column

Just to confirm this I also ran the code:

## Filter based on concept_class_id == 'Organism'
>Organism_from_concept_class_id_in_concept<- filter(df_concept, concept_class_id == 'Organism')

> str(Organism_from_concept_class_id_in_concept)

tibble [41,723 × 10] (S3: tbl_df/tbl/data.frame)
 $ concept_id      : int [1:41723] 40279766 40278707 40278715 40279267 40271947 40289635 40386796 40453827 40447657 40449283 ...
 $ concept_name    : chr [1:41723] "Dendroaspis viridis" "Porphyromonas assacharolytica" "Porphyromonas endodontalis" "Salmonella Gwoza" ...
 $ domain_id       : chr [1:41723] "Observation" "Observation" "Observation" "Observation" ...
 $ vocabulary_id   : chr [1:41723] "SNOMED" "SNOMED" "SNOMED" "SNOMED" ...
 $ concept_class_id: chr [1:41723] "Organism" "Organism" "Organism" "Organism" ...
 $ standard_concept: chr [1:41723] NA NA NA NA ...
 $ concept_code    : chr [1:41723] "112550001" "114155004" "114162008" "114297003" ...
 $ valid_start_date: Date[1:41723], format: "2002-01-31" "2002-01-31" ...
 $ valid_end_date  : Date[1:41723], format: "2002-01-31" "2002-01-31" ...
 $ invalid_reason  : chr [1:41723] "U" "U" "U" "U" ...

## Check to see if any other type of OMOP table appears in the column domain_id
> table(Organism_from_concept_class_id_in_concept['domain_id'])
domain_id
Observation 
      41723 ## This confirms only Observation appears here as there are 41723 rows and all contain Observation as the domain_id


Mapping the Ramses inpatient_microbiology table to the OMOP extract

Ramses OMOP
patient_id person_id (Observation table)
specimen_id specimen_id (Specimen table)
status All entries are NA so will ignore
specimen_type_display concept_name_specimen (Concept table)
specimen_datetime specimen_datetime (Specimen table)
organism_display_name concept_name_observation (Concept table)
isolate_id observation_id (Observation Table)
agent_display_name Still need to look into this
rsi_code Still need to look into this
isolation_datetime observation_datetime (Observastion table)

As a next step and while I wait for the OMOP extract observation table I will make a synthetic observation table with a few rows of microbiology isolate data and link it to the information in the other OMOP extract tables (Specimen, concept) and try to produce the in_patient_microbiology table to check that the Ramses validate code works with a table other than from the Ramses mock database, I will also look into how the rsi_code and agent_display_name can be obtained form the OMOP extract

@AngharadGreen
Copy link
Author

AngharadGreen commented Nov 28, 2024

Mapping OMOP to Ramses Inpatient_microbiology table using mock observation table

library(arrow) ## to read in the parquet files

##Read in OMOP tables as data frames
df_observation <- read.csv("Mock_Observation_Table.csv")
df_concept <- read_parquet("~/data/omop-metadata/data/concept.parquet")
df_specimen <- read_parquet("~/leicester/SPECIMEN.parquet")

## I have made a mock observation table that contains information matched from concept and specimen table from the OMOP extract.  
# First I am going to join this with the concept table based on concept ID
Merged_concept_observation <- merge(x=df_concept, y=df_observation, by.x = "concept_id", by.y = "observation_concept_id")
View(Merged_concept_observation)

## I then want to just filter the merged table to just show the information for the organisms identified via the SNOMED codes
# Read in the SNOMED organisms codes as a data frame
SNOMED_Organisms_Code <-read.csv("~/OMOP_RAMSES/SNOMED_Organisms.csv", header = TRUE, sep = ",")   

library(tidyverse)

Observation_SNOMED_Microorganisms <- filter(Merged_concept_observation, vocabulary_id == 'SNOMED',
                                     concept_code %in% SNOMED_Organisms_Code$SNOMED_Code)
View(Observation_SNOMED_Microorganisms) ## This has filtered out just the SNOMED microorganisms 

## I now want to merge the specimen table with the concept data frame via the concept_id
Merged_concept_specimen <- merge(x=df_concept , y=df_specimen, by.x = "concept_id", by.y = "specimen_concept_id")

## I now want to join the Merged_concept_specimen data frame with the Observation_SNOMED_Microorganisms dataframe on the person_id

Merged_DF <- merge(x=Merged_concept_specimen , y=Observation_SNOMED_Microorganisms, by.x = "person_id", by.y = "person_id")

View(Merged_DF)
  • When the Merged_DF is viewed, there are numerous entries for each of the person_id as different specimens have been collected from the same patient
  • I need to find a way of matching specimen_id to the observation table so the correct specimen information is pulled out to match the microbiology test result data.

@AngharadGreen
Copy link
Author

AngharadGreen commented Dec 2, 2024

Exploring the new OMOP extract that contains the observation table

  • Met with Alan Smith on 28 November and he sent over a new set of OMOP extract parquet files that contain the observation table
  • The new set of parquet files has been loaded into the R-Studio web log in portal
  • I will now explore the observation table to see if I can pull out bacterial isolate information
library(arrow) ## to read in the parquet files

##Read in OMOP tables as data frames
df_observation <- read_parquet("~/Ramses_extract_Nov2024/OBSERVATION.parquet")
df_concept <- read_parquet("~/data/omop-metadata/data/concept.parquet")
df_specimen <- read_parquet("~/Ramses_extract_Nov2024/SPECIMEN.parquet")

# First I am going to join the new observation table with the concept table based on concept ID
Merged_concept_observation <- merge(x=df_concept, y=df_observation, by.x = "concept_id", by.y = "observation_concept_id")
View(Merged_concept_observation)

## I then want to just filter the merged table to just show the information for the organisms identified via the SNOMED codes
# Read in the SNOMED organisms codes as a data frame
SNOMED_Organisms_Code <-read.csv("~/OMOP_RAMSES/SNOMED_Organisms.csv", header = TRUE, sep = ",")   

library(tidyverse)

Observation_SNOMED_Microorganisms <- filter(Merged_concept_observation, vocabulary_id == 'SNOMED',
                                     concept_code %in% SNOMED_Organisms_Code$SNOMED_Code)
View(Observation_SNOMED_Microorganisms) ## This has successfully filtered out just the SNOMED microorganisms that are within the new observation table

> colnames(Observation_SNOMED_Microorganisms)
 [1] "concept_id"                    "concept_name"                 
 [3] "domain_id"                     "vocabulary_id"                
 [5] "concept_class_id"              "standard_concept"             
 [7] "concept_code"                  "valid_start_date"             
 [9] "valid_end_date"                "invalid_reason"               
[11] "observation_id"                "person_id"                    
[13] "observation_date"              "observation_type_concept_id"  
[15] "observation_datetime"          "value_as_number"              
[17] "value_as_string"               "value_as_concept_id"          
[19] "qualifier_concept_id"          "unit_concept_id"              
[21] "provider_id"                   "visit_occurrence_id"          
[23] "visit_detail_id"               "observation_source_value"     
[25] "observation_source_concept_id" "unit_source_value"            
[27] "qualifier_source_value"    
  • Within this merged data frame, the column concept_name contains the information on the name of the bacteria detected from a microbiology laboratory test - this can be linked to the organism_display_name within the Ramses inpatient_microbiology table
  • I will now explore how to link the specimen table with the observation table so each individual specimen_id is matched to an individual observation_id so we can determine the specimen where each organism_display_name was identified and display this in the format of the inpatient_microbiology table

@AngharadGreen
Copy link
Author

AngharadGreen commented Dec 5, 2024

Linking the specimen_table with the observation_table based on the fact_relationship table

This forum discussion is also very helpful and focuses on the FACT_RELATIONSHIP OMOP table which contains records about the relationships between facts stored as records in any table of the CDM. - https://forums.ohdsi.org/t/link-between-specimen-and-measurement/4093, I will explore this table to see if it can link the specimen table with the observation table through the fact_relationship table

The FACT_RELATIONSHIP table contains records about the relationships between facts stored as records in any table of the CDM. Relationships can be defined between facts from the same domain (table), or different domains.
Field Required Type Description
domain_concept_id_1 The concept representing the domain of fact one, from
which the corresponding table can be inferred.
fact_id_1 The unique identifier in the table corresponding to the domain
of fact one.
domain_concept_id_2 The concept representing the domain of fact two, from
which the corresponding table can be inferred.
fact_id_2 The unique identifier in the table corresponding to the domain
of fact two.
relationship_concept_id A foreign key to a Standard Concept ID of
relationship in the Standardized Vocabularies.

## Lets look at the structure of the fact_relationship table in the updated OMOP extract

df_fact_relationship <- read_parquet("~/Ramses_extract_Nov2024/FACT_RELATIONSHIP.parquet")

> colnames(df_fact_relationship)
[1] "domain_concept_id_1"     "domain_concept_id_2"     "fact_id_1"              
[4] "fact_id_2"               "relationship_concept_id"

## Lets pull out the id numbers within the column domain_concept_id
> unique(df_fact_relationship$domain_concept_id_1)
[1] 1147333 1147330 1147304 1147305 1147306

> unique(df_fact_relationship$domain_concept_id_2)
[1] 1147330 1147333 1147304 1147305 1147306

I now need to match the column domain_concept_id to the domains they relate to, this can be found here https://athena.ohdsi.org/search-terms/terms/1147340

domain_concept_id domain
1147333 condition_occurrence
1147330 measurement
1147304 observation
1147305 device_exposure
1147306 specimen

I will now pull out the information that is just related to the observation (1147304) and specimen (1147306) tables

## I first tried to filter out the rows where domain_concept_id_1 =1147304 and "domain_concept_id_2" = 1147306, but that produced a data frame with no data = 

> fact_relationship_observation_specimen <- df_fact_relationship[df_fact_relationship$domain_concept_id_1 == "1147304" & df_fact_relationship$domain_concept_id_2 == "1147306", ]

> str(fact_relationship_observation_specimen)
tibble [0 × 5] (S3: tbl_df/tbl/data.frame)
 $ domain_concept_id_1    : int(0) 
 $ domain_concept_id_2    : int(0) 
 $ fact_id_1              : int(0) 
 $ fact_id_2              : int(0) 
 $ relationship_concept_id: int(0) 

## So next I filtered just based on domain_concept_id_1 with the observation id and then checked the unique data in column 2 
> fact_relationship_observation <- df_fact_relationship[df_fact_relationship$domain_concept_id_1 == "1147304", ]
> unique(fact_relationship_observation$domain_concept_id_2)
[1] 1147333 ## This id matches to condition_occurrence

## I then filtered  based on domain_concept_id_1 with the specimen id and then checked the unique data in column 2 
> fact_relationship_specimen <- df_fact_relationship[df_fact_relationship$domain_concept_id_1 == "1147306", ]
> unique(fact_relationship_specimen$domain_concept_id_2)
[1] 1147330  ##This id relates to measurement

##Just to be sure I also filtered based on domain_concept_id_2 with the observation id and then checked the unique data in column 1 

> fact_relationship_observation_id2 <- df_fact_relationship[df_fact_relationship$domain_concept_id_2 == "1147304", ]## This id matches to condition_occurrence
> unique(fact_relationship_observation_id2$domain_concept_id_1)
[1] 1147333 ##This id relates to condition_occurrence

##Just to be sure I also filtered based on domain_concept_id_2 with the specimen id and then checked the unique data in column 1 
> fact_relationship_specimen_id2 <- df_fact_relationship[df_fact_relationship$domain_concept_id_2 == "1147306", ]
> unique(fact_relationship_specimen_id2$domain_concept_id_1)
[1] 1147330 ##This id relates to measurement

From this analysis, the fact_relationship_table observation can only be linked to condition_occurrence and specimen can only be linked to measurement

@AngharadGreen
Copy link
Author

AngharadGreen commented Dec 5, 2024

Linking the SPECIMEN and OBSERVATION tables
Lets explore the relationship between observation and specimen

library(arrow) ## to read in the parquet files

## Lets read in the required tables from the new OMOP extract data and look at the column names to find a link
df_observation <- read_parquet("~/Ramses_extract_Nov2024/OBSERVATION.parquet")
df_specimen <- read_parquet("~/Ramses_extract_Nov2024/SPECIMEN.parquet")
colnames(df_observation)
colnames(df_specimen)

These are the data frame column headings that match across the tables:
image

  • The columns that are shared include person_id, unit_concept_id and unit_source_value
  • When viewing the tables, unit_concept_id and unit_source_value contain NA values in both the observation and specimen tables.
  • I am going to try to merge the observation, specimen and concept table by pulling out just the person_id and observation_id that are filtered out based on microorganisms, I will then pull out just the specimen information for those person_id where a microorganism has been identified.
library(arrow) ## to read in the parquet files

## Read in the required tables 
df_observation <- read_parquet("~/Ramses_extract_Nov2024/OBSERVATION.parquet")
df_specimen <- read_parquet("~/Ramses_extract_Nov2024/SPECIMEN.parquet")
df_concept <- read_parquet("~/data/omop-metadata/data/concept.parquet")


## I now want to create the observation table showing just the microbial_isolates as described previously
Merged_concept_observation <- merge(x=df_observation, y=df_concept, by.x = "observation_concept_id", by.y = "concept_id", )

SNOMED_Organisms_Code <-read.csv("~/OMOP_RAMSES/SNOMED_Organisms.csv", header = TRUE, sep = ",")

library(tidyverse)

Observation_SNOMED_Microorganisms <- filter(Merged_concept_observation, vocabulary_id == 'SNOMED',
                                            concept_code %in% SNOMED_Organisms_Code$SNOMED_Code)

## As these data frames are large, I will remove any older ones from the R environment as I go along 
rm(Merged_concept_observation)

## I now want to pull out just the columns with the person_ids and observation_ids from Observation_SNOMED_Microorganisms

df_1 <- subset(Observation_SNOMED_Microorganisms, select = c("person_id", "observation_id"))

## Merge df_specimen with observation_Observation_SNOMED_Microorganisms_2 via person_id

df_2 <- merge(x=df_1, y=df_specimen, by.x = "person_id", by.y = "person_id" )

## As these data frames are large, I will remove any older ones from the R environment as I go along 
rm(df_1)


## Merge Observation_SNOMED_Microorganisms_3 with concept based on concept_id to retrieve the information about specimen type

df_3 <- merge(x=df_2, y=df_concept, by.x = "specimen_concept_id", by.y = "concept_id" )

rm(df_2)

## In this merge there are columns that just contain NA values so I want to remove these 

df_4 <- Filter(function(x)!all(is.na(x)), df_3) ## this reduces the columns from 25 to 18

rm(df_3)

# I only want the information relevant to specimen so I will check what the values are in the domain_id table
unique(df_4$domain_id) ##[1] "Metadata" "Specimen"

## I now want to remove any rows related to metadata in this data frame 

df_5 <- filter(df_4, domain_id == 'Specimen')

rm(df_4)

## I will now check if metadata occurs in the domain_id column of Observation_SNOMED_Microorganisms_6

unique(df_5$domain_id) ## [1] "Specimen

## I now want to rename the column called concept_name in this data frame to specimen_type_display as information in this column shows type of specimen collected
## I also want to rename the column concept_code to SNOMED_specimen_code as these are the SNOMED codes for specimen type
## I also want to rename valid_start_date	and valid_end_date to valid_start_date_specimen and valid_end_date_specimen so these can be distinguished in future merges

df_6 <- df_5%>% rename(specimen_type_display = concept_name, 
                       SNOMED_specimen_code = concept_code,
                       valid_start_date_specimen = valid_start_date, 
                       valid_end_date_specimen = valid_end_date)

rm(df_5)

## Before doing any further merging I want to tidy up the Observation_SNOMED_Microorganisms data frame

## Lets get rid of any columns that only contain NA values in Observation_SNOMED_Microorganisms

Observation_SNOMED_Microorganisms_1 <- Filter(function(x)!all(is.na(x)), Observation_SNOMED_Microorganisms) ## this reduces the columns from 27 to 20

rm(Observation_SNOMED_Microorganisms)

## lets rename the concept_name column in Observation_SNOMED_Microorganisms_1 to organism_display_name as it shows the information about the type of microbe identified

Observation_SNOMED_Microorganisms_2 <- Observation_SNOMED_Microorganisms_1%>% rename(organism_display_name = concept_name)

rm(Observation_SNOMED_Microorganisms_1)

## I only want to keep the relevant columns within the data frame 

str(Observation_SNOMED_Microorganisms_2)

Observation_SNOMED_Microorganisms_3 <- subset(Observation_SNOMED_Microorganisms_2, select = c("person_id","observation_date","observation_datetime","visit_occurrence_id","unit_source_value","organism_display_name","concept_code","valid_start_date","valid_end_date"))
                                      
rm(Observation_SNOMED_Microorganisms_2)

## I want to rename the column concept_code in Observation_SNOMED_Microorganisms_2 to SNOMED_microorganism_code as these are the SNOMED codes for microorganisms
## I also want to rename valid_start_date and valid_end_date to valid_start_date_observation and valid_end_date_observation so these can be distinguished in future merges

Observation_SNOMED_Microorganisms_4 <- Observation_SNOMED_Microorganisms_3%>% rename(SNOMED_microorganism_code = concept_code,
                                                                                     valid_start_date_observation = valid_start_date,
                                                                                     valid_end_date_observation = valid_end_date)
rm(Observation_SNOMED_Microorganisms_3)

## I will also tidy up df_6

str(df_6)

## I only want to keep the relevant columns in df_6

df_7 <- subset(df_6, select = c("specimen_concept_id","person_id","observation_id","specimen_date","specimen_id","specimen_type_concept_id","specimen_datetime","disease_status_concept_id","specimen_type_display","SNOMED_specimen_code","valid_start_date_specimen","valid_end_date_specimen"))

rm(df_6)

## I am now going to merge df_7 with Observation_SNOMED_Microorganisms_3 table based on person_id
Merged_df_1 <- merge(x=Observation_SNOMED_Microorganisms_4, y=df_7, by.x = "person_id", by.y = "person_id")

  • This has worked to produce a large data frame where we the organism_display_name column matches to a specimen_type_display along with individual observation_id and specimen_id for each row

@AngharadGreen
Copy link
Author

AngharadGreen commented Dec 12, 2024

To summarise this issue was used to explore how microbiology lab data was stored within the OMOP extract. Here is a summary of what was found:

  • In the concept table under concept_class_id is a category called organism where names of individual microorganisms are stored under concept_name with the vocabulary_id SNOMED and within the domain_id column they are stored under the observation table, the SNOMED organisms codes are also stored within this table.
  • The observation table stores information about the type of microorganisms that were identified - the observation_concept_id can be mapped back to the concept table to get the name of the microorganism
  • The observation table was then linked to the specimen table by first pulling out just the person_id and observation_id from the observation table that are filtered out based on microorganisms (with SNOMED vocabulary), I will then pull out just the specimen information for those person_id where a microorganism has been identified.
  • From this exploration the following mapping was identified between Ramses and OMOP for the inpatient_microbiology table:

Mapping the Ramses inpatient_microbiology table to the OMOP extract

Ramses OMOP
patient_id person_id (Observation table)
specimen_id specimen_id (Specimen table)
status All entries are NA in Ramses table so will not map this to OMOP
specimen_type_display concept_name_specimen (Concept table)
specimen_datetime specimen_datetime (Specimen table)
organism_display_name concept_name_observation (Concept table)
isolate_id observation_id (Observation Table)
agent_display_name this will be completed in #29
rsi_code this will be completed in #29
isolation_datetime observation_datetime (Observastion table)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant