Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about #75

Closed
rjiang9 opened this issue Aug 29, 2024 · 16 comments · Fixed by #78
Closed

A question about #75

rjiang9 opened this issue Aug 29, 2024 · 16 comments · Fixed by #78

Comments

@rjiang9
Copy link

rjiang9 commented Aug 29, 2024

Hi folks,

In the REDCap sample inputs folder:

sample_inputs/redcap_example/manifest.yml, the schema line is (it does not designate the schema_class):

schema: https://raw.githubusercontent.com/CanDIG/katsu/develop/chord_metadata_service/mohpackets/docs/schema.yml

In the generic folder, manifest.yml has:

schema: https://raw.githubusercontent.com/CanDIG/katsu/develop/chord_metadata_service/mohpackets/docs/schema.yml
# class of schema for validation:
schema_class: MoHSchemaV3

In the ETL_code root, there are moh_v3_template.csv and moh_v2_template.csv, generated using different schemas.

I am working on a REDCap data mapping with the most recent CanDIG and ETL_code of develop branch, which schema should I use? I am also using the template from the redcap folder, what do I need to pay attention to?

Thanks a lot,
Ray

@rjiang9
Copy link
Author

rjiang9 commented Aug 29, 2024

Primary_site is under Donor in the sample redcap template, but in the v3 template is under primary_diagnoses.

@rjiang9
Copy link
Author

rjiang9 commented Aug 29, 2024

When I use schema_class: MoHSchemaV2

image
(candigetl) ➜  CANDIG python clinical_ETL_code/src/clinical_etl/CSVConvert.py --input carolyn-mappings/Singleton.csv --manifest carolyn-mappings/manifest.yml
Starting conversion...


 ==== Print module and schema_class and schema ...
<module 'clinical_etl.mohschemav2' from '/Users/ray.jiang/miniforge3/envs/candigetl/lib/python3.12/site-packages/clinical_etl/mohschemav2.py'>

MoHSchemaV2

https://raw.githubusercontent.com/CanDIG/katsu/develop/chord_metadata_service/mohpackets/docs/schema.yml
 ==== Print end.

Traceback (most recent call last):
  File "/Users/ray.jiang/Documents/CANDIG/clinical_ETL_code/src/clinical_etl/CSVConvert.py", line 827, in <module>
    main()
  File "/Users/ray.jiang/Documents/CANDIG/clinical_ETL_code/src/clinical_etl/CSVConvert.py", line 816, in main
    packets, errors = csv_convert(input_path, manifest_file, minify=args.minify, index_output=args.index,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ray.jiang/Documents/CANDIG/clinical_ETL_code/src/clinical_etl/CSVConvert.py", line 655, in csv_convert
    manifest = load_manifest(manifest_file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ray.jiang/Documents/CANDIG/clinical_ETL_code/src/clinical_etl/CSVConvert.py", line 610, in load_manifest
    schema = getattr(schema_mod, schema_class)(manifest["schema"])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ray.jiang/miniforge3/envs/candigetl/lib/python3.12/site-packages/clinical_etl/schema.py", line 115, in __init__
    self.template = self.add_default_mappings(raw_template)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ray.jiang/miniforge3/envs/candigetl/lib/python3.12/site-packages/clinical_etl/schema.py", line 246, in add_default_mappings
    index_value = self.validation_schema[temp]["id"]
                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: 'systemic_therapies'

But systemic_therapies looks new to schema V3.

@mshadbolt
Copy link
Contributor

Hi Ray! The MoHCCN recently transitioned to the v3 model and is now available on their website: https://www.marathonofhopecancercentres.ca/researcher-hub/policies-and-guidelines

As you have noticed, this included a few major changes such as the addition of systemic therapies, removal of chemotherapy/immunotherapy/hormone therapy objects, and moving primary site to primary diagnosis.

The sample redcap template is from a v2 model export, we are yet to do a v3 model export at this stage. Sorry for the confusion there.

If you are running the latest develop stack, you would need to have a clinical ingest json that is valid against the v3 data model schema. Was the redcap data you are working with curated to the v2 or v3 version of the data model? Happy to take a look at your csv template and manifest file to see if I can spot anything that might be causing issues.

We are planning on making a stable release including these latest data model updates in about a month or so. There are also a few more minor changes coming in data model v3.1, then we are hoping the data model stays stable for a while...

@rjiang9
Copy link
Author

rjiang9 commented Aug 29, 2024

Hi Marion,

First of all, thank you so much for getting back to me. I appreciate it.

If you are running the latest develop stack, you would need to have a clinical ingest json that is valid against the v3 data model schema.

We are running CanDIG v4.1.0.

Was the redcap data you are working with curated to the v2 or v3 version of the data model?

I took the template from sample_inputs/redcap_example/redcap2moh.csv and do the mappings with customized mapping functions in redcap.py

Happy to take a look at your csv template and manifest file to see if I can spot anything that might be causing issues.

I will attach the template and manifest file to this thread. I appreciate it for your help.

We are planning on making a stable release including these latest data model updates in about a month or so. There are also a few more minor changes coming in data model v3.1, then we are hoping the data model stays stable for a while...

Thank you and the team for all these work.

redcap2moh.csv

@rjiang9
Copy link
Author

rjiang9 commented Aug 29, 2024

Here is the code of manifest.yml

description:  The mappings of REDCap datat to MoHpackets format for katsu
mapping: redcap2moh.csv
identifier: submitter_donor_id
schema: https://raw.githubusercontent.com/CanDIG/katsu/develop/chord_metadata_service/mohpackets/docs/schema.yml
schema_class: MoHSchemaV2
reference_date: earliest_date(Singleton.date_of_diagnosis)
date_format: YMD
functions:
    - redcap

@rjiang9
Copy link
Author

rjiang9 commented Aug 29, 2024

redcap.txt

@mshadbolt
Copy link
Contributor

Hi Ray, thanks for sharing the files.

If you want to ingest the data into the stack running v4.1.0, the data will need to be compatible with data model v2. So the schema in the manifest will need to be the one on the stable branch of katsu. Can you try adjusting your manifest to:

description:  The mappings of REDCap datat to MoHpackets format for katsu
mapping: redcap2moh.csv
identifier: submitter_donor_id
schema: https://raw.githubusercontent.com/CanDIG/katsu/stable/chord_metadata_service/mohpackets/docs/schema.yml
schema_class: MoHSchemaV2
reference_date: earliest_date(Singleton.date_of_diagnosis)
date_format: YMD
functions:
    - redcap

Let me know if this works!

@rjiang9
Copy link
Author

rjiang9 commented Aug 30, 2024

Hi Marion, I will try that out and report back.

Thank you very much,
Ray

@rjiang9
Copy link
Author

rjiang9 commented Sep 4, 2024

Hi Marion,

I am trying out the sample_inputs/redcap_example but having a couple of questions,

  1. In the [redcap2moh.csv](https://github.com/CanDIG/clinical_ETL_code/blob/develop/sample_inputs/redcap_example/redcap2moh.csv), What is the Singleton in front of the SOURCE field in the right column ? is it actually the redcap_repeat_instrument row values in the raw_redcap.csv? Please see the redcap2moh.csv screenshot for what I mean.

  2. for the redcap exported file, I just need to put it in a folder, and when run CSV command, only the 'directory name' instead of full path/file_name is given, is that right (csvs/ is where I put the data file in my case below)?

python clinical_ETL_code/src/clinical_etl/CSVConvert.py --input csvs --manifest manifest.yml

Thanks a lot,
Ray

image

@rjiang9
Copy link
Author

rjiang9 commented Sep 4, 2024

PS: when I was trying to run the CSVConvert on example mappings, I got

image

I assume those names were just typo and all should be raw_redcap, is this correct?

@mshadbolt
Copy link
Contributor

Hi Ray,

  1. The 'Singleton' would refer to a csv with that filename in the source csvs. You would need to edit all these source csv names in the redcap2moh.csv to match which csv that field is found in your data. Do you have multiple csvs in your csvs directory?
  2. For the --input, the path would be relative to where you are running the script I believe, so this would work if you are running the script from the same location as where your csvs directory and manifest.yml file are.
  3. the P.S.: If you are getting this error it would indicate that you don't have the csvs listed within your csvs input directory. Is that the case?

When we worked with a redcap export, we needed to do some preprocessing of the redcap csv to split it up into the different csvs that correspond to the various schemas before running it through clinical_etl. Perhaps this is a missing step for your data currently?

@rjiang9
Copy link
Author

rjiang9 commented Sep 4, 2024

Thank you for clearing them up for me, Marion.

> When we worked with a redcap export, we needed to do some preprocessing of the redcap csv to split it up into the different csvs that correspond to the various schemas before running it through clinical_etl. Perhaps this is a missing step for your data currently?

This is what I missed. I thought I could just run the redcap sample mappings against the raw_redcap.csv (the single one large exported csv file) included in the repo. I did not realize that the file need to be split by the schema.

Do you happen to have those preprocessing split csvs files from that raw_redcap.csv file for me to take a look?

> 1. ... Do you have multiple csvs in your csvs directory?

At this moment, I just have one single exported csv file. As you mentioned, I need to split it up into different csv files to correspond to the various schemas.

> 2. For the --input, the path would be relative to where you are running the script I believe, so this would work if you are running the script from the same location as where your csvsdirectory andmanifest.yml file are.

Got it.

> 3. the P.S.: If you are getting this error it would indicate that you don't have the csvs listed within your csvs input directory. Is that the case?

Here I was trying to run the sample. I don't have those csvs - just that raw_redcap.csv in the repo.

@mshadbolt
Copy link
Contributor

Hi Ray,

Ok! I think I understand. I can work on sharing the python script that will split the file into csvs in the same folder.

These files are a bit out of date since the redcap export format we were working with changed. I am not sure whether or not it will be relevant for you when you export from your own redcap database.

For now, are you just trying to see how things work using this as an example or are you trying to use the same methods to convert your own real data? Does your own data follow a similar format to the raw_redcap.csv or is it different?

We provided these files as something that worked for us previously but I am not sure how much they need to be customised for different redcap databases and what options there are when exporting out of redcap that would affect how they run so would be great to understand your experience so far.

@rjiang9
Copy link
Author

rjiang9 commented Sep 5, 2024

Hi Marion,

The REDCap data file I have is very similar to the raw_redcap.csv in the format. If you can share the split csvs files and the python scripts for doing that, it'll be very helpful for me to see how things are working and how you do the split.

All the best,
Ray

@mshadbolt
Copy link
Contributor

mshadbolt commented Sep 5, 2024

Hi Ray, I have made a PR that adds the splitting script. It will be on the develop branch when it gets approved and merged but in the meantime you can also grab the script from here:

https://github.com/CanDIG/clinical_ETL_code/blob/mshadbolt/add-redcap-csv-split-script/sample_inputs/redcap_example/split_redcap_data.py

Hope it works for your export!

@rjiang9
Copy link
Author

rjiang9 commented Sep 5, 2024

This is great. Thank you so much Marion. Big help!

@rjiang9 rjiang9 closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants