Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIG-1045: First pass at MoH model #13

Merged
merged 62 commits into from
Jun 14, 2023
Merged

DIG-1045: First pass at MoH model #13

merged 62 commits into from
Jun 14, 2023

Conversation

daisieh
Copy link
Member

@daisieh daisieh commented Jun 9, 2023

Based on the katsu openapi schema's DonorWithClinicalData schema, we can now generate a template file, update it with mappings, and run CSVConvert on it.

From this repo, you can only really test template generation:

python generate_schema.py --url https://raw.githubusercontent.com/CanDIG/katsu/develop/chord_metadata_service/mohpackets/docs/schema.yml

Detailed testing can be done in the subsequent clinical_ETL_data pull request.

@kcranston
Copy link
Member

This is working for me. Yay! I think we need some more documentation, particularly on how to use the new template (explaining the formatting, specifying the index field, etc). See related comment on the ETL_code PR

@daisieh
Copy link
Member Author

daisieh commented Jun 12, 2023

I do have one concern about the way we're doing this now: by default, CSVConvert will try to find values for all of the template fields, even if a mapping hasn't been explicitly specified. Is this a good idea? It might make it very hard to figure out which field is the one causing an error if the user hasn't specified something.

@daisieh
Copy link
Member Author

daisieh commented Jun 12, 2023

Also: should the running of ingest_redcap_data just happen as part of running CSVConvert, or is it better to do that as a separate script?

@kcranston
Copy link
Member

Looking at the new template and code (and the example in ETL_data), here are some questions that we should probably address in the documentation:

  • what do the lines that start with ## mean, e.g. ##primary_diagnoses, or ##primary_diagnoses.0,
  • we need to explain the new indexed_on mapping function
  • what do the 0 mean in the fields, e.g. primary_diagnoses.0.specimens.0.submitter_specimen_id,
  • we should explain the nesting
  • what happens if no mapping function is supplied?

@kcranston
Copy link
Member

Also: should the running of ingest_redcap_data just happen as part of running CSVConvert, or is it better to do that as a separate script?

I think separate (different electronic data capture systems will all need a custom massage script). We should specify the required input format for CSVConvert, though.

@daisieh
Copy link
Member Author

daisieh commented Jun 12, 2023

We could specify the massage script as part of the manifest, though: I like being able to connect all of the scripts that were run in a single file package so that the provenance of how we got our ingest data is very clear.

@daisieh
Copy link
Member Author

daisieh commented Jun 12, 2023

Also I think we should move all of the documentation that is currently in the data repo README to this one, since it's not a guarantee that users will have access to that private repo.

@kcranston
Copy link
Member

I do have one concern about the way we're doing this now: by default, CSVConvert will try to find values for all of the template fields, even if a mapping hasn't been explicitly specified. Is this a good idea? It might make it very hard to figure out which field is the one causing an error if the user hasn't specified something.

In general, I think this is the behaviour we want. In order to make it less confusing / easier to debug, we probably want a combination of the following:

  • a verbose setting that logs everything
  • a report at the end with a summary of warnings, errors, coverage, etc

@daisieh daisieh requested a review from yavyx June 12, 2023 18:41
@daisieh
Copy link
Member Author

daisieh commented Jun 12, 2023

I would like to dispense with the ## header lines, which I had described as informational in the readme: Entries that begin with ## are informational: they can be overwritten or deleted completely from the mapping file. Were these helpful at all?

In addition, is the primary_diagnoses.0.treatments.0.chemotherapies.0 style of description in the mapping template confusing for users? Would it be more helpful to replace those 0 values with, perhaps, the word INDEX, so it'd be like primary_diagnoses.INDEX.treatments.INDEX.chemotherapies.INDEX?

@kcranston
Copy link
Member

I am going to put on my documentation hat and take a stab at more updates to the readme before approving.

@kcranston
Copy link
Member

I would like to dispense with the ## header lines, which I had described as informational in the readme: Entries that begin with ## are informational: they can be overwritten or deleted completely from the mapping file. Were these helpful at all?

I think more confusing than helpful

In addition, is the primary_diagnoses.0.treatments.0.chemotherapies.0 style of description in the mapping template confusing for users? Would it be more helpful to replace those 0 values with, perhaps, the word INDEX, so it'd be like primary_diagnoses.INDEX.treatments.INDEX.chemotherapies.INDEX?

I like this suggestion

@kcranston
Copy link
Member

Pushed a first pass at updates to the readme (note that I am pulling out the mapping function documentation into a separate file)

@daisieh daisieh merged commit bbd19fc into main Jun 14, 2023
@daisieh daisieh deleted the daisieh/test branch June 14, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants