OMOP ES files as inputs to PIXL CLI #159

stefpiatek · 2023-12-01T17:35:54Z

Definition of Done / Acceptance Criteria

OMOP ES parquet files parsed
- patient identifiers, imaging accession numbers, study date and OMOP ES's image identifier (procedure_id) added to rabbitmq messages
OMOP log file (json) parsed
- project name added to rabbitmq messages (settings.cdm_source_name in the json)
- timestamp of omop es run added to rabbitmq messages (datetime in the json)

Testing

convert current csv tests to use parquet files, may be easier for reviewing to create a helper function that reads in the test csv files and writes the required parquet files to a tmpdir. That way we can keep plain text as inputs so its easier to compare diffs in test inputs.

Documentation

Update README for inputs to CLI

Dependencies

Details and Comments

Current state

Currently there is a csv file input which defines the MRN, accession number and study datetime.

Info

We're expecting the same filename structure for tables -> parquet. So maybe take a directory as an input?
OMOP ES will be doing all the cohort definition work, with its output as a parquet
parquet has the benefit of being binary, fast, and has typed columns

The text was updated successfully, but these errors were encountered:

stefpiatek · 2023-12-11T10:31:43Z

Files

We have dummy files: I think the procedue concept use might need to be updated but this gives us the structure to get started

private.zip
public.zip
extract_summary.json

Started poking around with the files after extracting the data from the parquet files

from pathlib import Path

import pandas as pd

public_dir = Path("public")
private_dir = Path("private")


if __name__ == "__main__":
    # MRN in people.PrimaryMrn:
    people = pd.read_parquet(private_dir / "PERSON_LINKS.parquet")
    # accession number in accessions.AccesionNumber
    accessions = pd.read_parquet(private_dir / "PROCEDURE_OCCURRENCE_LINKS.parquet") 
    # study_date is in procedure.procdure_date
    procedure = pd.read_parquet(public_dir / "PROCEDURE_OCCURRENCE.parquet")
    # joining data together
    people_procedures = people.join(procedure, on="person_id", lsuffix="_people")
    joined = people_procedures.join(accessions, on="procedure_occurrence_id", rsuffix="_links")
    # TODO filter by procedure concept to match the imaging type, could hardcode for now 
    joined[["person_id", "PrimaryMrn", "AccessionNumber", "procedure_date"]]

jeremyestein · 2023-12-11T16:29:00Z

What is wrong with the current CSV method? If this is a performance concern, have any measurements been made to confirm this?
Also, why are there multiple parquet files (that look like a 1:1 dump of OMOP tables) instead of say, a single parquet file that comes from an OMOP query, that is similar in format to the current CSV format?

ruaridhg · 2023-12-11T16:38:48Z

Also, what was the difference between the private and public parquet files? And which parquet file are we using as an input for the PIXL cli and/or are we combining data from both files as the input?

stefpiatek · 2023-12-11T17:00:03Z

What is wrong with the current CSV method? If this is a performance concern, have any measurements been made to confirm this?
Also, why are there multiple parquet files (that look like a 1:1 dump of OMOP tables) instead of say, a single parquet file that comes from an OMOP query, that is similar in format to the current CSV format?

Nah not performance. OMOP ES is now defining the cohort definition, we want to use their output as the input to the tool so the workflow is simplified. They are indeed dumps of OMOP tables, we're going to publish public parquet files to the DSH.

Also, what was the difference between the private and public parquet files? And which parquet file are we using as an input for the PIXL cli and/or are we combining data from both files as the input?

public: no identifiable data, we will export this to the DSH without modifying the files
private: has the links of the sequential ids to real identifiers so that we can link with real data
yep we're combining both to be able to link the data with clinical data

jeremyestein · 2023-12-13T16:05:08Z

We have decided we're not doing filtering right now, but this is relevant for when we do:

@ruaridhg and I looked at the example parquet files and found the following two OMOP codes:
“CT of chest” https://athena.ohdsi.org/search-terms/terms/4058335
“CT of thorax with contrast” https://athena.ohdsi.org/search-terms/terms/4327032
Firstly, are we interested in X-rays or CTs? I thought it was the former.
Also, given the way the OMOP ontology works, there isn’t going to be a single code for most things. You have to traverse the “is-a” relationships to find what you want. I don’t think this will add that much complexity to the code though - we don’t have to query a live OMOP database; we could just do it once and hard code those values. Might get a bit unwieldy if we expand beyond chest x-rays.

dcartner · 2023-12-14T17:34:52Z

4163872 & 42538241 are the codes for the Xrays we're using at UCLH as translated into OMOP by Leilei

stefpiatek · 2023-12-14T17:35:26Z

Thanks @dcartner I wonder if its reasonable to ask for the omop es log to define which omop IDs are imaging for each export. That way we can process this generically

dcartner · 2023-12-14T18:26:05Z

Good question, I'm not very familiar with the log file at the moment but will get back to you

stefpiatek added this to the 100-days milestone Dec 1, 2023

Chloestockford83 assigned ruaridhg and milanmlft Dec 11, 2023

stefpiatek assigned jeremyestein and unassigned milanmlft Dec 11, 2023

ruaridhg mentioned this issue Dec 15, 2023

Parquet input to OS #186

Merged

stefpiatek assigned milanmlft and unassigned ruaridhg Dec 18, 2023

milanmlft mentioned this issue Dec 19, 2023

Refactor message serialisation and deserialisation #197

Merged

milanmlft closed this as completed in #186 Dec 21, 2023

stefpiatek mentioned this issue Dec 22, 2023

Copy public extracts upon export of data #201

Merged

stefpiatek mentioned this issue Jan 30, 2024

Ensure only imaging studies are processed from OMOP ES parquet files #212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMOP ES files as inputs to PIXL CLI #159

OMOP ES files as inputs to PIXL CLI #159

stefpiatek commented Dec 1, 2023 •

edited by milanmlft

Loading

stefpiatek commented Dec 11, 2023 •

edited

Loading

jeremyestein commented Dec 11, 2023

ruaridhg commented Dec 11, 2023

stefpiatek commented Dec 11, 2023

jeremyestein commented Dec 13, 2023

dcartner commented Dec 14, 2023 •

edited by stefpiatek

Loading

stefpiatek commented Dec 14, 2023 •

edited

Loading

dcartner commented Dec 14, 2023

OMOP ES files as inputs to PIXL CLI #159

OMOP ES files as inputs to PIXL CLI #159

Comments

stefpiatek commented Dec 1, 2023 • edited by milanmlft Loading

Definition of Done / Acceptance Criteria

Testing

Documentation

Dependencies

Details and Comments

Current state

Info

stefpiatek commented Dec 11, 2023 • edited Loading

jeremyestein commented Dec 11, 2023

ruaridhg commented Dec 11, 2023

stefpiatek commented Dec 11, 2023

jeremyestein commented Dec 13, 2023

dcartner commented Dec 14, 2023 • edited by stefpiatek Loading

stefpiatek commented Dec 14, 2023 • edited Loading

dcartner commented Dec 14, 2023

stefpiatek commented Dec 1, 2023 •

edited by milanmlft

Loading

stefpiatek commented Dec 11, 2023 •

edited

Loading

dcartner commented Dec 14, 2023 •

edited by stefpiatek

Loading

stefpiatek commented Dec 14, 2023 •

edited

Loading