Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning of drug_exposure and concept tables extracted from bigQuery Database of Synthetic patient OMOP data #32

Closed
zsenousy opened this issue Oct 9, 2024 · 4 comments
Assignees

Comments

@zsenousy
Copy link

zsenousy commented Oct 9, 2024

Issues have been encountered while processing data tables extracted from Synthetic patient OMOP data.
1- indexing have been concatenated to the first column values.
2- Parsing problem which required further separation of text.

@zsenousy zsenousy self-assigned this Oct 9, 2024
@zsenousy zsenousy changed the title Cleaning process of drug_exposure and concept tables extracted from bigQuery Database of Synthetic patient OMOP data Cleaning of drug_exposure and concept tables extracted from bigQuery Database of Synthetic patient OMOP data Oct 9, 2024
@zsenousy
Copy link
Author

zsenousy commented Oct 9, 2024

R Script has been developed that maps drug_exposure to RAMSES drug_prescriptions. Synthetic Patient Data in OMOP has been used as OMOP public dataset.

The following steps provide instructions on accessing chunks from Drug_exposure and Concept tables required for the mapping process into RAMSES drug_prescriptions:

1- Load required libraries:

library(bigrquery)
library(DBI)
library(gargle)

2- Authenticate connection to BigQuery

sa_key_path <- ("path/to/your/key_file")

bq_auth(path = sa_key_path)

project_id <- “your_project_id”

con <- dbConnect(bigquery(), project = project_id, dataset = “bigquery-public-data”)

3- SQL queries for retrieving data from drug_exposure and concept tables:

sql_string1 <- “SELECT * FROM bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure LIMIT 10000”
sql_string2 <- "SELECT * FROM bigquery-public-data.cms_synthetic_patient_data_omop.concept WHERE domain_id IN ('Drug', 'Route', 'Unit')"
result1 <- dbGetQuery(con, sql_string1)
result2 <- dbGetQuery(con, sql_string2)

4- Save data

write.table(result1, file =  "path/to/your/drug_exposure_file.csv")
write.table(result2, file = "path/to/your/concept_file.csv")

Note: Concept table has been filtered on domain_id = {DRUG, ROUTE, and UNIT}.

This is the process of extracting csv files from Bigquery database. The next comment will be showing steps conducted to clean CSV files so that it would be in an accepted format for #25 (R-script of mapping OMOP drug_exposure to RAMSES drug_prescriptions)

@razekmh
Copy link

razekmh commented Oct 9, 2024

@zsenousy Would it be possible to start a branch for this based on the develop branch?

@zsenousy
Copy link
Author

zsenousy commented Oct 9, 2024

Drug_exposure

Here is a detailed breakdown of each step to clean and save the data:

  1. Reading the data from CSV:
    The bigQuery data extract is loaded using the read_delim function from the readr package. The file is read using a space delimiter (" "), which is why the delimiter is explicitly defined in the command.

    drug_exposure <- read_delim("./path/to/your/file/uncleaned_drug_exposure.csv", delim = " ")
    • Outcome: The dataset contains 10,000 rows and 23 columns.
    • Important Note: A warning message is given that there might be issues with parsing, suggesting the use of problems() to identify any parsing errors.
  2. Checking the structure of the data (str function):
    The str(drug_exposure) function is used to inspect the structure of the dataset, including column types. This helps identify issues such as columns being incorrectly parsed or having unexpected data types.

    str(drug_exposure)
  3. Removing the first column:
    The first column (often the index) is removed using drug_exposure[, -1]. This operation drops the first column of the data frame.

    drug_exposure <- drug_exposure[, -1]
    • Outcome: The data frame now has 22 columns instead of 23, and the first column is removed.
  4. Renaming columns:
    The colnames() function is used to rename columns for clarity, using a predefined list of new column names. This step helps in assigning column names to their corresponding column values. This is due to the shifting occurred in adding index column after reading the csv file in step 1.

    colnames(drug_exposure) <- c(
      "drug_type_concept_id", "stop_reason", "refills", "quantity", 
      "days_supply", "sig", "route_concept_id", "lot_number", 
      "provider_id", "visit_occurrence_id", "visit_detail_id", 
      "drug_source_value", "drug_source_concept_id", "route_source_value", 
      "dose_unit_source_value", "drug_exposure_id", "person_id", 
      "drug_concept_id", "drug_exposure_start_date", 
      "drug_exposure_start_datetime", "drug_exposure_end_date", 
      "drug_exposure_end_datetime", "verbatim_end_date"
    )
    • Outcome: The columns are renamed. After renaming, you can inspect the structure again with str(drug_exposure) to verify the new column names.
  5. Adding an NA column for verbatim_end_date:
    The removal of first column and assigning of new column names in steps 3 and 4 resulted in the removal of verbatim_end_date column. This step adds the column verbatim_end_date filled with NA values. This column likely existed with incorrect or missing data earlier and is now being replaced with NA values.

    drug_exposure$verbatim_end_date <- NA
    • Outcome: The verbatim_end_date column now contains only NA values.
  6. Fixing the drug_exposure_end_datetime column:
    The drug_exposure_end_datetime column has values like "NA NA", which should be replaced with actual NA values. This is done by identifying these instances and replacing them with NA.

    drug_exposure$drug_exposure_end_datetime[drug_exposure$drug_exposure_end_datetime == "NA NA"] <- NA
    • Outcome: The problematic "NA NA" strings are replaced with actual NA values in the drug_exposure_end_datetime column.
  7. Saving the cleaned data to CSV:
    Finally, after the data has been cleaned and modified, it is saved as a CSV file. The write_csv function from the readr package is used to save the data frame to the specified file path.

    write_csv(drug_exposure, "./path/to/your/file/cleaned_drug_exposure.csv")

Now, the dataset is clean and saved in a format that can be used for further analysis.

@zsenousy
Copy link
Author

Concept

Here is a detailed breakdown of each step to clean and save the data:

  1. Reading the Data:
    The read_delim() function is used to read the CSV file, specifying that the delimiter is a space (" "), and quotes are handled properly. The warning mentions some parsing issues, which can be reviewed using problems(), but here it's being ignored.

    concept <- read_delim("./path/to/your/file/uncleaned_concept_new.csv", delim = " ", quote = '"', col_names = TRUE, trim_ws = TRUE)
  2. Mutating 'invalid_reason' Column:
    The mutate() function from the dplyr package is used to modify the invalid_reason column. It first removes quotes using gsub(), then extracts the date (before the first space) and the code (after the first space), creating new columns invalid_reason_before and invalid_reason_after. Finally, the original invalid_reason column is removed using select(-invalid_reason).

    concept <- concept %>%
        mutate(
            invalid_reason = gsub("\"", "", invalid_reason),  # Remove quotes
            invalid_reason_before = sub(" .*", "", invalid_reason),  # Extract date before space
            invalid_reason_after = sub(".* ", "", invalid_reason)  # Extract code after the space
        ) %>%
        select(-invalid_reason)  # Remove original invalid_reason column
  3. Checking Structure:
    The str() function is used to examine the structure of the modified dataset, ensuring the new columns are properly added and the original incorrectly parsed invalid_reason column is removed.

    str(concept)
  4. Dropping the concept_id Column:
    The concept_id column is removed from the dataset using select(). This column holds index values and not the correct concept ids. That's why it has been removed.

    concept <- concept %>% select(-concept_id)
  5. Renaming Columns:
    The colnames() function is used to rename the columns. This part reassigns the same names in the correct order after removal of index column in step 4.

    colnames(concept) <- c("concept_id", "concept_name", "domain_id", "vocabulary_id", 
                           "concept_class_id", "standard_concept", "concept_code", 
                           "valid_start_date", "valid_end_date", "invalid_reason")
  6. Converting Dates:
    The valid_end_date column is converted to the Date type using as.Date() with the format "%Y-%m-%d". This ensures the dates are handled correctly in further operations.

    concept$valid_end_date <- as.Date(concept$valid_end_date, format = "%Y-%m-%d")
  7. Writing the Cleaned Data:
    The write_csv() function is used to save the cleaned concept dataset to a new CSV file, cleaned_concept.csv.

    write_csv(concept, "./path/to/your/file/cleaned_concept.csv")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants