msk-mind · darinmoore · Aug 29, 2024 · Sep 4, 2024
diff --git a/clinical-data-mining/ddp_id_mapping.md b/clinical-data-mining/ddp_id_mapping.md
@@ -4,23 +4,21 @@
 <b>Table Type:</b> `Live` <br/>
 <b>Late updated:</b> `2024-05-17` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/ddp_id_mapping.sql))</b> 
 
 `CDM NLP Processes` <br/>
-|_ ["phi_data_lake"."cdm-data"."id-mapping"."ddp_id_mapping_pathology.tsv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22cdm-data%22%2C%22id-mapping%22%2C%22ddp_id_mapping_pathology.tsv%22%5D) <br/>
+|_ `"phi_data_lake"."cdm-data"."id-mapping"."ddp_id_mapping_pathology.tsv"` <br/>
 
 <b>Summary Statistics:</b>
 
 Total number of rows: 199,989 <br/>
 Total number of unique patients: 101,377 <br/>
 Total number of unique IMPACT sample_ids: 199,986 <br/>
 
-
-1. [Description](#description)
-2. [Assumptions](#assumptions)
-3. [Vocabulary and Encoding](#vocabulary)
-3. [Rules](#rules)
-
+1. [Description ](#description)
+2. [Assumptions ](#assumptions)
+3. [Vocabulary \& Encoding ](#vocabulary--encoding)
+4. [Notes ](#notes)
 
 ## Description <a name="description"></a>
 
@@ -42,7 +40,7 @@ Reference CDSI documentation - [CDM Codebook](https://docs.google.com/spreadshee
 | `SAMPLE_ID` | Identifies an IMPACT sample  | ID | string |
 
 
-## Rules <a name="rules"></a>
+## Notes <a name="notes"></a>
 
 1. MRNs must be zero padded to eight digits. (They are compared as strings, not integers.)
 2. A single MRN can have multiple IMPACT samples associated with it.

diff --git a/clinical-data-mining/demographics.md b/clinical-data-mining/demographics.md
@@ -1 +1,51 @@
+# Demographics
 
+<b>Path:</b> `"phi_data_lake"."cdm-data".demographics."ddp_demographics.tsv"` <br/>
+<b>Table Type:</b> `Live` <br/>
+<b>Late updated:</b> `2024-05-17` <br/>
+
+<b>Lineage: ([SQL](sql/demographics.sql))</b> 
+
+`CDM NLP Processes` <br/>
+|_ `"phi_data_lake"."cdm-data"."id-mapping"."ddp_id_mapping_pathology.tsv"` <br/>
+
+<b>Summary Statistics:</b>
+
+Total number of rows: 121,855 <br/>
+Total number of unique patients: 121,855 <br/>
+
+1. [Description ](#description)
+2. [Assumptions ](#assumptions)
+3. [Vocabulary \& Encoding ](#vocabulary--encoding)
+4. [Notes ](#notes)
+
+## Description <a name="description"></a>
+
+Provides a mapping between MRN and the patient's demographics
+
+## Assumptions <a name="assumptions"></a>
+
+No known assumptions.
+
+
+## Vocabulary & Encoding <a name="vocabulary"></a>
+
+Reference CDSI documentation - [CDM Codebook](https://docs.google.com/spreadsheets/d/1po0GdSwqmmXibz4e-7YvTPUbXpi0WYv3c2ImdHXxyuc/edit#gid=187767892)
+
+| **Field name** | **Description** | **Field Type** | **Encoding** |
+|---|---|---|---|
+| `MRN` | Medical Record Number, a unique identifier per patient  | ID | string |
+| `PT_BIRTH_DTE` | Date of patient's birth | date | string |
+| `PT_DEATH_DTE` | Date of patient's death  | date | string |
+| `MRN_CREATE_DTE` | Date MRN was assigned to patient(?)  | date | string |
+| `GENDER` | Gender of the patient | `MALE` or `FEMALE` | string |
+| `MARITAL STATUS` | Marital status of the patient | `SINGLE`, `MARRIED`, `DIVORCED`, or `WIDOWED` | string |
+| `RELIGIION` | Religion of the patient |  | string |
+| `RACE` | Race of the patient |  | string |
+| `ETHNICITY` | Ethnicity of the patient |  | string |
+| `CURRENT_AGE_DEID` | Age of the patient | age (in years) | string |
+
+
+## Notes <a name="notes"></a>
+
+1. If a date does not exist (for example, if the patient is alive) the field will contain empty text.
diff --git a/clinical-data-mining/pathology_diagnoses.md b/clinical-data-mining/pathology_diagnoses.md
@@ -4,10 +4,10 @@
 <b>Table Type:</b> `Live` <br/>
 <b>Late updated:</b> `2024-07-10` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/pathology_diagnoses.sql))</b> 
 
 `CDM NLP Processes` <br/>
-|_ ["phi_data_lake"."cdm-data".pathology."table_pathology_surgical_samples_parsed_specimen.tsv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22cdm-data%22%2C%22pathology%22%2C%22table_pathology_surgical_samples_parsed_specimen.tsv%22%5D) <br/>
+|_ `"phi_data_lake"."cdm-data".pathology."table_pathology_surgical_samples_parsed_specimen.tsv"` <br/>
 
 <b>Summary Statistics:</b>
 
@@ -18,8 +18,8 @@ Total number of unique parts: 832,946 <br/>
 
 1. [Description](#description)
 2. [Assumptions](#assumptions)
-3. [Vocabulary and Encoding](#vocabulary)
-3. [Rules](#rules)
+3. [Vocabulary \& Encoding](#vocabulary--encoding)
+4. [Notes](#notes)
 
 
 ## Description <a name="description"></a>
@@ -48,7 +48,7 @@ Reference CDSI documentation - [CDM Codebook](https://docs.google.com/spreadshee
 
 
 
-## Rules <a name="rules"></a>
+## Notes <a name="notes"></a>
 
 1. MRNs are not zero padded, so they should not be matched to MRNs in other tables.
 2. A single MRN can have multiple IMPACT samples associated with it.

diff --git a/clinical-data-mining/pathology_reports.md b/clinical-data-mining/pathology_reports.md
@@ -4,22 +4,21 @@
 <b>Table Type:</b> `Live` <br/>
 <b>Late updated:</b> `2024-05-17` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/pathology_reports.sql))</b> 
 
 `CDM NLP Processes` <br/>
-|_ ["phi_data_lake"."cdm-data".pathology."table_pathology_impact_sample_summary_dop_anno.tsv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22cdm-data%22%2C%22pathology%22%2C%22table_pathology_impact_sample_summary_dop_anno.tsv%22%5D) <br/>
+|_ `"phi_data_lake"."cdm-data".pathology."table_pathology_impact_sample_summary_dop_anno.tsv"` <br/>
 
 <b>Summary Statistics:</b>
 
 Total number of rows: 200,451 <br/>
 Total number of unique patients: 101,605 <br/>
 Total number of unique IMPACT sample_ids: 200,448 <br/>
 
-
 1. [Description](#description)
 2. [Assumptions](#assumptions)
-3. [Vocabulary and Encoding](#vocabulary)
-3. [Notes](#notes)
+3. [Vocabulary \& Encoding](#vocabulary--encoding)
+4. [Notes](#notes)
 
 
 ## Description <a name="description"></a>

diff --git a/clinical-data-mining/sql/ddp_id_mapping.sql b/clinical-data-mining/sql/ddp_id_mapping.sql
@@ -0,0 +1 @@
+SELECT * FROM "ddp_id_mapping_pathology.tsv"
diff --git a/clinical-data-mining/sql/demographics.sql b/clinical-data-mining/sql/demographics.sql
@@ -0,0 +1 @@
+SELECT * FROM "ddp_demographics.tsv"
diff --git a/clinical-data-mining/sql/pathology_diagnoses.sql b/clinical-data-mining/sql/pathology_diagnoses.sql
@@ -0,0 +1 @@
+SELECT * FROM "table_pathology_surgical_samples_parsed_specimen.tsv"
diff --git a/clinical-data-mining/sql/pathology_reports.sql b/clinical-data-mining/sql/pathology_reports.sql
@@ -0,0 +1 @@
+SELECT * FROM "table_pathology_impact_sample_summary_dop_anno.tsv"
diff --git a/hobbit/hobbit-casebreakdown-cleaned.md b/hobbit/hobbit-casebreakdown-cleaned.md
@@ -4,11 +4,11 @@
 <b>Table Type:</b> Live <br/>
 <b>Late updated:</b> 2024-05-17 <br/>
 
-<b>Lineage:</b>
+<b>Lineage:([SQL](sql/hobbit-casebreakdown-cleaned.sql))</b>
 
 HoBBit SQL Server <br/>
 |_ ["hobbit-poc"."case_breakdown"](hobbit-casebreakdown.md) <br/>
-&nbsp;&nbsp;&nbsp;&nbsp;|_ ["hobbit-poc"."case_breakdown_cleaned"](https://tlvidreamcord1:9047/new_query?context=%22pathology-data-mining%22&queryPath=%5B%22pathology-data-mining%22%2C%22impact_slide%22%2C%22case_breakdown_cleaned%22%5D) <br/>
+&nbsp;&nbsp;&nbsp;&nbsp;|_ `"pathology-data-mining"."impact_slide"."case_breakdown_cleaned"` <br/>
 
 <b>Summary Statistics:</b>
 
@@ -18,10 +18,10 @@ Total number of unique slides: 6,235,731 <br/>
 
 Last updated July 1, 2024.  (New slides are typically added weekly.)
 
-1. [Description](#description)
+1. [Description ](#description)
 2. [Assumptions](#assumptions)
-3. [Vocabulary and Encoding](#vocabulary)
-3. [Rules](#rules)
+3. [Vocabulary \& Encoding](#vocabulary--encoding)
+4. [Notes](#notes)
 
 ## Description <a name="description"></a>
 
@@ -42,6 +42,6 @@ For example, it is assumed there cannot be a slide with two different stain grou
 
 See the datasheet for the parent [hobbit-case-breakdown](hobbit-casebreakdown.md) dataset. 
 
-## Rules
+## Notes <a name="notes">
 
 See the datasheet for the parent [hobbit-case-breakdown](hobbit-casebreakdown.md) dataset. 
diff --git a/hobbit/hobbit-casebreakdown.md b/hobbit/hobbit-casebreakdown.md
@@ -5,10 +5,10 @@
 <b>Table Type:</b> `Live` <br/>
 <b>Late updated:</b> `2024-05-17` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/hobbit-casebreakdown.sql))</b> 
 
 `HoBBit SQL Server` <br/>
-|_ ["hobbit-poc"."case_breakdown"](https://tlvidreamcord1:9047/new_query?context=%22hobbit-poc%22&queryPath=%5B%22hobbit-poc%22%2C%22case_breakdown%22%5D) <br/>
+|_ `"hobbit-poc"."case_breakdown"` <br/>
 
 <b>Summary Statistics:</b>
 
@@ -21,7 +21,7 @@ Total number of unique slides: 6,192,174 <br/>
 1. [Description](#description)
 2. [Assumptions](#assumptions)
 3. [Vocabulary and Encoding](#vocabulary)
-3. [Rules](#rules)
+3. [Notes](#notes)
 
 
 ## Description <a name="description"></a>
@@ -88,7 +88,7 @@ The columns below are relevant to clinical operations and may not be useful for
 | status_id | | ID | string | |
 | captured_datatime | date and time when the image was captured by the scanner | date & time | datetime | |
 
-# Rules <a name="rules"></a>
+# Notes <a name="notes"></a>
 
 1. Not all slides created at MSK are scanned and represented in this dataset.
 2. Not all slides in this dataseet can be used for research. About 1% of the slides cannot be de-identified and therefore cannot be used for research.

diff --git a/hobbit/sql/hobbit-casebreakdown-cleaned.sql b/hobbit/sql/hobbit-casebreakdown-cleaned.sql
@@ -0,0 +1,4 @@
+-- This assumes that if two rows have the same image_id but different values in any other column, those rows and those image_ids are garbage and are being discarded
+-- For example there can be two rows with the only difference being stain_type 
+select * from (select DISTINCT * from "hobbit-poc"."case_breakdown") where image_id not in 
+ (select image_id from (select DISTINCT * from "hobbit-poc"."case_breakdown") GROUP BY image_id having count(image_id) <> 1)
diff --git a/hobbit/sql/hobbit-casebreakdown.sql b/hobbit/sql/hobbit-casebreakdown.sql
@@ -0,0 +1 @@
+select cases.* from hobbit_prod.DMSKPWAP.tmp.case_breakdown cases
diff --git a/pathology-data-mining/master_slide_inventory.md b/pathology-data-mining/master_slide_inventory.md
@@ -1,63 +1,35 @@
 # Master Slide Inventory
 
-Last updated 2024-07-08
+<b> Path:</b>  <br/> 
+<b>Table Type:</b> `Live` <br/> 
+<b>Last updated 2024-07-08</b> <br/> 
+
+<b>Lineage: ([SQL](sql/master_slide_inventory.sql))</b>
+
+
+<b>Summary Statistics:</b>
+
+Total number of rows: 469,703 <br/>
+Total number of unique slides: 461,184 <br/>
+Total number of unique patients: 56,858 <br/>
+Total number of samples: 65,892 <br/>
 
 1. [Description](#description)
-2. [Assumptions](#assumptions)
-3. [Vocabulary and Encoding](#vocabulary)
-3. [Rules](#rules)
+2. [Vocabulary and Encoding](#vocabulary)
+3. [Notes](#notes)
 
 ## Description <a name="description"></a>
 
-### Motivation
-
 This dataset lists the WSI data that we have on our local storage systems.  (As of June
 2024, that generally means `/gpfs/mskmind_emc/data_large/`)  Each row of this table represents a
 single slide and includes data such as the slide's id, project, magnification, cancer
 type, and its storage location.
 
-### How was this data put together?
-
-### How should this data be used?
-
-### Access
-This dataset is available in Dremio at
-`"pathology-data-mining"."master_slide_inventory.md"`
-
-### How often is this data updated
-This table is updated manually when new slides are received from the pathology department.
-Currently, that's about once a week.
-
-
-## Assumptions <a name="assumptions"></a>
-
-
 ## Vocabulary & Encoding <a name="vocabulary"></a>
 
+See the datasheet for the parent [hobbit-case-breakdown](../hobbit/hobbit-casebreakdown.md) dataset. 
 
-## Rules <a name="rules"></a>
-
-
-## Statistics
-
-There are a total of 469,703 rows, corresponding to data from 65,892 samples from 56,858 patients. In total, there are 461,184 slides. 
-
-```
--- Row count
-select count(*)  FROM "pathology-data-mining"."impact_slide"."impact_slide"
-
--- sample Count
-select count(DISTINCT(SAMPLE_ID))  FROM "pathology-data-mining"."impact_slide"."impact_slide"
-
--- patient Count
-select count(DISTINCT(PATIENT_ID))  FROM "pathology-data-mining"."impact_slide"."impact_slide"
-
--- slide Count
-select count(DISTINCT(IMAGE_ID))  FROM "pathology-data-mining"."impact_slide"."impact_slide"
-
-
-```
-
-
+## Notes <a name="notes"></a>
 
 
+See the datasheet for the parent [hobbit-case-breakdown](../hobbit/hobbit-casebreakdown.md) dataset. 
diff --git a/pathology-data-mining/ocra/ocra_master_table.md b/pathology-data-mining/ocra/ocra_master_table.md
@@ -4,7 +4,7 @@
 <b>Table Type:</b> `contains live datasets in lineage` <br/>
 <b>Last updated:</b> `2024-07-06` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/ocra_master_table.sql))</b> 
 
 ["pathology-data-mining".impact_slide.case_breakdown_cleaned](https://github.com/msk-mind/datasheets-for-datasets/blob/main/hobbit/hobbit-casebreakdown-cleaned.md) (as t1) <br/>
 ["phi_data_lake"."cdm-data".pathology."table_pathology_impact_sample_summary_dop_anno.tsv"](https://github.com/msk-mind/datasheets-for-datasets/blob/main/clinical-data-mining/pathology_reports.md) (as t2) <br/>

diff --git a/pathology-data-mining/ocra/rachel_grisham_brca_cohort.md b/pathology-data-mining/ocra/rachel_grisham_brca_cohort.md
@@ -4,10 +4,10 @@
 <b>Table Type:</b> `Static` <br/>
 <b>Last updated:</b> `2024-07-06` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/rachel_grisham_brca_cohort.sql))</b> 
 
 Dr. Rachel Grisham <br/>
-|_ [phi_data_lake.ocra."HRD_Shah_cohort.csv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22ocra%22%2C%22HRD_Shah_cohort.csv%22%5D) <br/>
+|_ `phi_data_lake.ocra."HRD_Shah_cohort.csv"` <br/>
 
 <b>Summary Statistics:</b>
 
@@ -18,7 +18,7 @@ Total number of unique patients: 105 <br/>
 1. [Description](#description)
 2. [Assumptions](#assumptions)
 3. [Vocabulary and Encoding](#vocabulary)
-3. [Notes](#notes)
+4. [Notes](#notes)
 
 
 ## Description <a name="description"></a>

diff --git a/pathology-data-mining/ocra/rachel_grisham_cohort.md b/pathology-data-mining/ocra/rachel_grisham_cohort.md
@@ -4,10 +4,10 @@
 <b>Table Type:</b> `Static` <br/>
 <b>Late updated:</b> `2024-05-17` <br/>
 
-<b>Lineage:</b> 
+<b>Lineage: ([SQL](sql/rachel_grisham_cohort.sql))</b> 
 
 Dr. Rachel Grisham <br/>
-|_ [OCRA."HRD_RG_data"](https://tlvidreamcord1:9047/new_query?context=%22OCRA%22&queryPath=%5B%22OCRA%22%2C%22HRD_RG_data%22%5D) <br/>
+|_ `OCRA."HRD_RG_data"` <br/>
 
 <b>Summary Statistics:</b>
 
@@ -18,7 +18,7 @@ Total number of unique patients: 426 <br/>
 1. [Description](#description)
 2. [Assumptions](#assumptions)
 3. [Vocabulary and Encoding](#vocabulary)
-3. [Notes](#notes)
+4. [Notes](#notes)
 
 
 ## Description <a name="description"></a>
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		SELECT * FROM "table_pathology_surgical_samples_parsed_specimen.tsv"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		SELECT * FROM "table_pathology_impact_sample_summary_dop_anno.tsv"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		select cases.* from hobbit_prod.DMSKPWAP.tmp.case_breakdown cases