DIG-1045: First pass at MoH model #13

daisieh · 2023-06-09T05:55:01Z

Based on the katsu openapi schema's DonorWithClinicalData schema, we can now generate a template file, update it with mappings, and run CSVConvert on it.

From this repo, you can only really test template generation:

python generate_schema.py --url https://raw.githubusercontent.com/CanDIG/katsu/develop/chord_metadata_service/mohpackets/docs/schema.yml

Detailed testing can be done in the subsequent clinical_ETL_data pull request.

kcranston · 2023-06-12T17:58:30Z

This is working for me. Yay! I think we need some more documentation, particularly on how to use the new template (explaining the formatting, specifying the index field, etc). See related comment on the ETL_code PR

daisieh · 2023-06-12T18:14:24Z

I do have one concern about the way we're doing this now: by default, CSVConvert will try to find values for all of the template fields, even if a mapping hasn't been explicitly specified. Is this a good idea? It might make it very hard to figure out which field is the one causing an error if the user hasn't specified something.

daisieh · 2023-06-12T18:16:20Z

Also: should the running of ingest_redcap_data just happen as part of running CSVConvert, or is it better to do that as a separate script?

kcranston · 2023-06-12T18:18:09Z

Looking at the new template and code (and the example in ETL_data), here are some questions that we should probably address in the documentation:

what do the lines that start with ## mean, e.g. ##primary_diagnoses, or ##primary_diagnoses.0,
we need to explain the new indexed_on mapping function
what do the 0 mean in the fields, e.g. primary_diagnoses.0.specimens.0.submitter_specimen_id,
we should explain the nesting
what happens if no mapping function is supplied?

kcranston · 2023-06-12T18:19:43Z

Also: should the running of ingest_redcap_data just happen as part of running CSVConvert, or is it better to do that as a separate script?

I think separate (different electronic data capture systems will all need a custom massage script). We should specify the required input format for CSVConvert, though.

daisieh · 2023-06-12T18:21:48Z

We could specify the massage script as part of the manifest, though: I like being able to connect all of the scripts that were run in a single file package so that the provenance of how we got our ingest data is very clear.

daisieh · 2023-06-12T18:23:21Z

Also I think we should move all of the documentation that is currently in the data repo README to this one, since it's not a guarantee that users will have access to that private repo.

kcranston · 2023-06-12T18:35:10Z

I do have one concern about the way we're doing this now: by default, CSVConvert will try to find values for all of the template fields, even if a mapping hasn't been explicitly specified. Is this a good idea? It might make it very hard to figure out which field is the one causing an error if the user hasn't specified something.

In general, I think this is the behaviour we want. In order to make it less confusing / easier to debug, we probably want a combination of the following:

a verbose setting that logs everything
a report at the end with a summary of warnings, errors, coverage, etc

daisieh · 2023-06-12T18:44:51Z

I would like to dispense with the ## header lines, which I had described as informational in the readme: Entries that begin with ## are informational: they can be overwritten or deleted completely from the mapping file. Were these helpful at all?

In addition, is the primary_diagnoses.0.treatments.0.chemotherapies.0 style of description in the mapping template confusing for users? Would it be more helpful to replace those 0 values with, perhaps, the word INDEX, so it'd be like primary_diagnoses.INDEX.treatments.INDEX.chemotherapies.INDEX?

…_code into daisieh/test

kcranston · 2023-06-12T19:38:53Z

I am going to put on my documentation hat and take a stab at more updates to the readme before approving.

kcranston · 2023-06-12T19:39:39Z

I would like to dispense with the ## header lines, which I had described as informational in the readme: Entries that begin with ## are informational: they can be overwritten or deleted completely from the mapping file. Were these helpful at all?

I think more confusing than helpful

In addition, is the primary_diagnoses.0.treatments.0.chemotherapies.0 style of description in the mapping template confusing for users? Would it be more helpful to replace those 0 values with, perhaps, the word INDEX, so it'd be like primary_diagnoses.INDEX.treatments.INDEX.chemotherapies.INDEX?

I like this suggestion

kcranston · 2023-06-13T17:52:54Z

Pushed a first pass at updates to the readme (note that I am pulling out the mapping function documentation into a separate file)

kcranston added 30 commits February 1, 2023 09:26

add code for generating template from moh schema

b51c9ba

remove csv from ignore so we can add sample templates

b4a667d

sample moh template

f2b7226

hard code top level moh schemas

a49ffd4

more detail in readme

ec24362

remove katsu requirement

3c4ca8a

use manifest for manifest file

65daf7c

drop empty columns, clean up manufest import

1caebf8

separate pre-processing of redcap export

db5c952

save tmpfiles in parent of input_path

2e67070

save to parent of input_path, clean up prints

5beadf8

fix copy bug, delete empty rows method

a7444ad

separate scaffold creation from manifest read

bfe43c1

add key to map_row method so we know what field

9ceed92

check for exact match on field name

ea56ca6

use single_val mapping on exact match fields

4de15c0

do not output index col

de24724

add moh schema class

172c039

rename schema ids

283295c

update readme

1e5f1e1

add requests dependency

05ab56b

reuse skip schema list

2eb0b9d

output filename for schema

1bccd1b

break into smaller functions, lots of prints

ad78005

rename data variable

7a201d2

more informative variable name

cbf5306

update skipped schemas

f92029e

only print if verbose

84afcfc

update requirements

85b27e9

class for moh schema and mappings

817c487

daisieh requested a review from yavyx June 12, 2023 18:41

daisieh added 2 commits June 12, 2023 12:17

Update README.md

3e51fd7

Merge branch 'daisieh/test' of https://github.com/CanDIG/clinical_ETL…

0d98df3

…_code into daisieh/test

kcranston and others added 5 commits June 13, 2023 12:41

move usage to the top, add section for input file format

5c6d4b1

update gitignore

b33d58b

describe input data, separate mapping documentation

b7269f4

rename .0 with .INDEX

2580644

remove leading descriptive lines

d42117f

daisieh added 3 commits June 13, 2023 10:54

clean up merge conflict

9961d0d

missed the INDEX fields

081b9ee

remove extra print

a1f1691

daisieh force-pushed the daisieh/test branch from d38c170 to a1f1691 Compare June 14, 2023 01:17

kcranston added 3 commits June 14, 2023 12:44

more docs for mapping functions

50ac7b0

finalize docs

33917b7

add sample manifest file

bd7fb3e

kcranston approved these changes Jun 14, 2023

View reviewed changes

daisieh merged commit bbd19fc into main Jun 14, 2023

daisieh deleted the daisieh/test branch June 14, 2023 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIG-1045: First pass at MoH model #13

DIG-1045: First pass at MoH model #13

daisieh commented Jun 9, 2023

kcranston commented Jun 12, 2023

daisieh commented Jun 12, 2023

daisieh commented Jun 12, 2023

kcranston commented Jun 12, 2023

kcranston commented Jun 12, 2023

daisieh commented Jun 12, 2023

daisieh commented Jun 12, 2023

kcranston commented Jun 12, 2023

daisieh commented Jun 12, 2023

kcranston commented Jun 12, 2023

kcranston commented Jun 12, 2023

kcranston commented Jun 13, 2023

DIG-1045: First pass at MoH model #13

DIG-1045: First pass at MoH model #13

Conversation

daisieh commented Jun 9, 2023

kcranston commented Jun 12, 2023

daisieh commented Jun 12, 2023

daisieh commented Jun 12, 2023

kcranston commented Jun 12, 2023

kcranston commented Jun 12, 2023

daisieh commented Jun 12, 2023

daisieh commented Jun 12, 2023

kcranston commented Jun 12, 2023

daisieh commented Jun 12, 2023

kcranston commented Jun 12, 2023

kcranston commented Jun 12, 2023

kcranston commented Jun 13, 2023