-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMOP ES files as inputs to PIXL CLI #159
Comments
Files We have dummy files: I think the procedue concept use might need to be updated but this gives us the structure to get started private.zip Started poking around with the files after extracting the data from the parquet files from pathlib import Path
import pandas as pd
public_dir = Path("public")
private_dir = Path("private")
if __name__ == "__main__":
# MRN in people.PrimaryMrn:
people = pd.read_parquet(private_dir / "PERSON_LINKS.parquet")
# accession number in accessions.AccesionNumber
accessions = pd.read_parquet(private_dir / "PROCEDURE_OCCURRENCE_LINKS.parquet")
# study_date is in procedure.procdure_date
procedure = pd.read_parquet(public_dir / "PROCEDURE_OCCURRENCE.parquet")
# joining data together
people_procedures = people.join(procedure, on="person_id", lsuffix="_people")
joined = people_procedures.join(accessions, on="procedure_occurrence_id", rsuffix="_links")
# TODO filter by procedure concept to match the imaging type, could hardcode for now
joined[["person_id", "PrimaryMrn", "AccessionNumber", "procedure_date"]] |
What is wrong with the current CSV method? If this is a performance concern, have any measurements been made to confirm this? |
Also, what was the difference between the private and public parquet files? And which parquet file are we using as an input for the PIXL cli and/or are we combining data from both files as the input? |
Nah not performance. OMOP ES is now defining the cohort definition, we want to use their output as the input to the tool so the workflow is simplified. They are indeed dumps of OMOP tables, we're going to publish
|
We have decided we're not doing filtering right now, but this is relevant for when we do: @ruaridhg and I looked at the example parquet files and found the following two OMOP codes: |
Thanks @dcartner I wonder if its reasonable to ask for the omop es log to define which omop IDs are imaging for each export. That way we can process this generically |
Good question, I'm not very familiar with the log file at the moment but will get back to you |
Definition of Done / Acceptance Criteria
procedure_id
) added to rabbitmq messagessettings.cdm_source_name
in the json)datetime
in the json)Testing
convert current csv tests to use parquet files, may be easier for reviewing to create a helper function that reads in the test csv files and writes the required parquet files to a tmpdir. That way we can keep plain text as inputs so its easier to compare diffs in test inputs.
Documentation
Dependencies
Details and Comments
Current state
Currently there is a csv file input which defines the MRN, accession number and study datetime.
Info
The text was updated successfully, but these errors were encountered: