Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CCDH Pilot examples #22

Merged
merged 54 commits into from
Jan 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
bae5e99
Added demonstrator files from data-model-harmonization.
gaurav Oct 5, 2021
6e86287
First stab at trying to read d1 data as Python Data Classes.
gaurav Oct 5, 2021
8035ace
Commented out parts of d1 examples so that they pass testing.
gaurav Oct 5, 2021
590b633
Commented out non-functional pieces of ccdh-pilot.
gaurav Oct 5, 2021
ee27d09
Commented out non-functional code.
gaurav Oct 5, 2021
41bc9dd
Cleaned up test_all.py.
gaurav Oct 5, 2021
2ba85e6
Added JSON Schema validation for specimens, commented out some lines.
gaurav Oct 5, 2021
36a912a
Passed all JSON Schema validation.
gaurav Oct 5, 2021
a309df9
Added test_import.py to transform some GDC data.
gaurav Oct 9, 2021
4a8d419
Added GDC-to-CRDC-H transformed data.
gaurav Oct 9, 2021
1f2aec5
Added `age_at_diagnosis`.
gaurav Oct 9, 2021
70c7055
Renamed `test_all` to `test_validate_all`; removed `days` as text.
gaurav Oct 9, 2021
ff3722b
Added `condition`.
gaurav Oct 9, 2021
af90616
Added `primary_site`.
gaurav Oct 9, 2021
6414084
Added ICD-10 codes.
gaurav Oct 9, 2021
3e9f074
Added code for constructing CancerStageObservations.
gaurav Oct 9, 2021
c27bf32
Added AJCC cancer staging, but commented it out.
gaurav Oct 9, 2021
f1f9e57
Added diagnosis_date.
gaurav Oct 9, 2021
9b0aab1
Added basic related specimen.
gaurav Oct 9, 2021
559e50f
Added related specimen IDs.
gaurav Oct 9, 2021
0113cfc
Apparently we can get identifiers working if we add a system?!
gaurav Oct 9, 2021
b927c8c
Added a Diagnosis.subject (from GDC:Diagnosis.submitter_id)
gaurav Oct 9, 2021
15df4ff
Added some additional specimen fields.
gaurav Oct 9, 2021
4d458bd
Added creation times.
gaurav Oct 9, 2021
f695782
Added additional Specimen.creation_activity fields.
gaurav Oct 9, 2021
525acbf
Added time-between-excision-and-freezing.
gaurav Oct 9, 2021
decbe75
Added processing activity.
gaurav Oct 10, 2021
52d93d2
Added sample_submitter_id.
gaurav Oct 10, 2021
e68a536
Added submitter IDs.
gaurav Oct 10, 2021
f4b41d9
Added a link as per Python style requirements.
gaurav Oct 10, 2021
0a48864
Renamed test_input to clearly indicate that it is the GDC import only.
gaurav Oct 10, 2021
df39453
Renamed test_import_gdc.py to reflect that it's a transform.
gaurav Oct 10, 2021
5497766
Deleted redundant test_build.py test.
gaurav Oct 10, 2021
58dd11f
Optimized imports.
gaurav Oct 10, 2021
9b06ecd
Added and disambiguated submitter IDs and case IDs.
gaurav Oct 10, 2021
131b07b
Added case and submitter identifiers for diagnoses.
gaurav Oct 10, 2021
c03b292
Updated some comments.
gaurav Oct 10, 2021
1f0c961
Added diagnosis date from `year of diagnosis`.
gaurav Oct 10, 2021
297d71d
Added issue for approx date bug.
gaurav Oct 10, 2021
41a5bea
First stab at transforming PDC.
gaurav Oct 10, 2021
f8d8ab2
Tweaked outputs.
gaurav Oct 10, 2021
c928a1c
Updated PDC to pass testing.
gaurav Oct 11, 2021
d84c070
Centralized some shared code.
gaurav Oct 11, 2021
5758c49
Added a note about why we are truncating value_decimal.
gaurav Oct 11, 2021
94d5a35
Deliberately created a validation error for demo.
gaurav Oct 11, 2021
1103ce0
Reverting to truncating integers.
gaurav Oct 11, 2021
8719410
Added morphology.
gaurav Oct 12, 2021
c497f4e
Added morphology to PDC head and mouth example.
gaurav Nov 18, 2021
3a21bc9
Updated crdch-model to 1.2.
gaurav Jan 5, 2022
0b8b5fe
Updated expected JSONLD/TTL files.
gaurav Jan 5, 2022
5167062
Merge branch 'main' into add-pilot-examples
gaurav Jan 5, 2022
337e734
Reformatted source files with `black`.
gaurav Jan 5, 2022
fd588f1
Renamed GitHub workflow pytest.yaml to be accurate.
gaurav Jan 5, 2022
7114014
Added some README files to document content in these folders.
gaurav Jan 5, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Test with pytest and flake8
name: Test with pytest and black

on: [push]

Expand Down
5 changes: 5 additions & 0 deletions ccdh-pilot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
This folder contains content created for the CCDH Pilot in September 2021.

It includes the two demonstrated developed for this Pilot as well as converters
that translate imported node data into the CRDC-H model as YAML, then validate
those transformed files using JSON Schema as well as LinkML Python data classes.
2 changes: 2 additions & 0 deletions ccdh-pilot/demonstrator-1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Files in this directory were copied here from
https://github.com/cancerDHC/data-model-harmonization/tree/d83c2853ea6cdd8dfbb9204c2cd4c335969660c6/data-examples/f2f-2021-09-data-examples
342 changes: 342 additions & 0 deletions ccdh-pilot/demonstrator-1/d1_harmonized_gdc_specimen_cc.yaml

Large diffs are not rendered by default.

192 changes: 192 additions & 0 deletions ccdh-pilot/demonstrator-1/d1_harmonized_icdc_specimen_cc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
icdc_specimen:
Title: ICDC Specimen Example - Sept 2021 F2F Demonstrator 1
Schema: ccdh.0713cc.Specimen
Description:
- "CRDC-H-compliant representation of the 'aggregate' ICDC source data example described here: https://docs.google.com/spreadsheets/d/14lWLDD7iyJG0G57m6BgWYIeqrxK04P3qQRwxziqEz1A/edit#gid=0"
- "**FOR EASIEST VIEWING TURN WORD WRAP OFF SO THAT COMMENTS DON'T RUN ONTO A NEW LINE**"
- "In this example enumerated values are captured using a CodeableConcept object which bundles one or more Coding objects, each holding a single concept code and supporting metadata."

Example: # type = Specimen
id: "f2f:052c671d-118a-11e9-afb9-0a9c39d33490" # Local uuid for the Specimen a assigned by the system in which this record lives (here, the CCDH F2F Demonstrator, hence the 'f2f' prefix).
# identifier: # type = Identifier (Identifier is a complex data type, but we are only showing a single field from this type)
# - value: "ncats-cop:NCATS-COP01-CCB050227 0103" # ICDC.Sample.sample_id (specified here as a CURIE, where the prefix indicates the system that assigned it)
# - value: "icdc-sample:30502" # ICDC.Sample._id (specified here as a CURIE, where the prefix indicates the system that assigned it)
# - value: "biosample:SAMEA1652206" # Proposed globally unique identifier format, per recommendation of the CRDC-DST (specified here as a CURIE, where the prefix indicates the system that assigned it)

specimen_type: # type = CodeableConcept
coding: # type = Coding
- code: C84517 # The first Coding holds the harmonized code for this concept - the code for an NCIT term.
label: Fresh Specimen # The preferred label of the term from NCIT.
system: http://ncithesaurus.nci.nih.gov # A URL for the NCIT system (this is likely not the url we would ultimately use)
tag: # A tag indicating this to be the harmonized code for this concept
- harmonized
- code: Sample # The second Coding holds the original code used by the source node. Here, this is the 'type' of the entity in ICDC (Sample).
label: Sample # The human-readable label (implicitly the same as the code, which is human-readable)
system: http://crdc.nci.nih.gov/icdc # A URL for the ICDC system (we made this up - t.b.d. what an official URL would look like)
tag: # A 'tag' indicating this to be the code for this concept in the original source data.
- original
source_subject: # type = Subject
id: "f2f:01b2691b-63d8-11e8-bcf1-0a2705229b82" # Local uuid assigned by this system for the Subject. The Subject instance is represented separately later in this document.
source_material_type: # type = CodeableConcept (see comments on the 'specimen_type' field above for a detailed explanation of this data object's content)
coding: # type = Coding
- code: C12801 # The first Coding holds the harmonized code for this concept - the code for an NCIT term.
label: Tissue # The preferred label of the term from NCIT.
system: http://ncithesaurus.nci.nih.gov # A URL for the NCIT system (this is likely not the url we would ultimately use)
tag: # A tag indicating this to be the harmonized code for this concept
- harmonized
- code: Tissue # The second Coding holds the original code used by the source node. ICDC uses a readable strings for their codes.
label: Tissue # The human-readable label (implicitly the same as the code, which are human-readable in ICDC)
system: http://crdc.nci.nih.gov/icdc # A URL for the ICDC system (we made this up - t.b.d. what an official URL would look like)
tag: # A 'tag' indicating this to be the code for this concept in the original source data.
- original
general_tissue_pathology: # type = CodeableConcept
coding: # type = Coding
- code: C14143
label: Malignant
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
- code: Malignant
label: Malignant
system: http://crdc.nci.nih.gov/icdc
tag:
- original
specific_tissue_pathology: # type = CodeableConcept
coding: # type = Coding
- code: C9145
label: Osteosarcoma
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
- code: Osteosarcoma
label: Osteosarcoma
system: http://crdc.nci.nih.gov/icdc
tag:
- original
tumor_status_at_collection: # type = CodeableConcept
coding: # type = Coding
- code: C3261
label: Metastatic Neoplasm
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
- code: Metastatic
label: Metastatic
system: http://crdc.nci.nih.gov/icdc
tag:
- original
creation_activity: # type = SpecimenCreationActivity. This object encapsulates data related to how a specimen was created. It is inlined because this is not a stand-alone, identified entity.
activity_type: # type = CodeableConcept. This field indicates that the activity instance here represents the initial collection from a source, rather than derivation from an existing specimen (e.g via portioning or aliquoting).
coding: # type = Coding
- code: C93435
label: Performed Specimen Collection
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
date_ended: # type = TimePoint
date_time: "2010-07-23"
collection_site: # type = BodySite
site: # type = CodeableConcept
coding: # type = Coding
- code: C12468
label: Lung
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
- code: Lung
label: Lung
system: http://crdc.nci.nih.gov/icdc
tag:
- original
processing_activity: # type = SpecimenProcessingActivity. This object encapsulates data related to how a specimen was processed. It is inlined because this is not a stand-alone, identified entity.
- activity_type: # type = CodeableConcept
coding: # type = Coding
- code: CC63521
label: Quick Freeze
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
- code: Snap Frozen
label: Snap Frozen
system: http://crdc.nci.nih.gov/icdc
tag:
- original

---
icdc_subject:
Title: ICDC Subject Example - Sept 2021 F2F Demonstrator 1
Schema: ccdh.0713.Subject
Description:
- "This is a minimal example showing only id and identifiers. For a richer Subject record, see the examples in Demonstrator 2."
Example: # type: Subject
id: "f2f:17029fd8-cd05-4089-8da3-52795823a647" # Local uuid assigned by this system for the Subject
# identifier: # type: Identifier
# - value: "icdc-case:30196" # ICDC.Cases.case_id
# - value: "crdc:su0000003" # Proposed globally unique identifier per recommendation of the CRDC-DST

---
icdc_diagnosis:
Title: ICDC Diagnosis Example - Sept 2021 F2F Demonstrator 1
Schema: "ccdh.0713cc.Diagnosis"
Description:
- "CRDC-H-compliant representation of the synthetic source data example described here: https://docs.google.com/spreadsheets/d/14lWLDD7iyJG0G57m6BgWYIeqrxK04P3qQRwxziqEz1A/edit#gid=0"
- "For easiest viewing, **TURN WORD WRAP OFF** so that comments don't run onto a new line."
Example: # type = Diagnosis
id: "f2f:de3468e83-4c86-4258-98d1-a986445cce73" # Local uuid assigned by this system for the Diagnosis
# identifier: # type = Identifier
# - value: "icdc-diagnosis:30279" # ICDC.Diagnosis._id
subject: # type = Subject
id: "f2f:01b2691b-63d8-11e8-bcf1-0a2705229b82"
condition: # type = CodeableConcept
coding: # type = Coding
- code: C9145
label: Osteosarcoma
system: http://ncithesaurus.nci.nih.gov
tag:
- harmonized
- code: Osteosarcoma (OS)
label: Osteosarcoma (OS)
system: http://crdc.nci.nih.gov/icdc
tag:
- original
# primary_site: # type = BodySite
# - site: # type = CodeableConcept
# coding: # type = Coding
# - code: C12366
# label: Bone
# system: http://ncithesaurus.nci.nih.gov
# tag:
# - harmonized
# - code: Bone
# label: Bone
# system: http://crdc.nci.nih.gov/icdc
# tag:
# - original
# stage: # type = CancerStageObservationSet
# - observations: # type = CancerStageObservation - one of many individual Stage determinations (e.g. T, N, M, Overall) about the same tumor that can comprise a single CancerStageObservationSet (here just an 'Overall' assessment, but may also include T, N, and M assessments).
# - observation_type: # type = CodeableConcept (This Codeable Concept does not contain a second Coding for the original source code because the staging level semantic is captured in an explicit property in the source PDC model, not a key-value pair as shown here)
# coding: # type = Coding
# - code: C25605
# label: Overall
# system: http://ncithesaurus.nci.nih.gov
# tag:
# - harmonized
# value_codeable_concept: # type = CodeableConcept
# coding: # type = Coding
# - code: C27971
# label: Stage IV
# system: http://ncithesaurus.nci.nih.gov
# tag:
# - harmonized
# - code: IV
# label: IV
# system: http://crdc.nci.nih.gov/icdc
# tag:
# - original
diagnosis_date: # type = TimePoint
date_time: "2013-06-14"
related_specimen: # type = Specimen
- id: "f2f:052c671d-118a-11e9-afb9-0a9c39d33490" # A reference to a specimen that was used to generate this Diagnosis (the Specimen instance above).



Loading