Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

style changes / additional info for existing datasheets #2

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions clinical-data-mining/ddp_id_mapping.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,21 @@
<b>Table Type:</b> `Live` <br/>
<b>Late updated:</b> `2024-05-17` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/ddp_id_mapping.sql))</b>

`CDM NLP Processes` <br/>
|_ ["phi_data_lake"."cdm-data"."id-mapping"."ddp_id_mapping_pathology.tsv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22cdm-data%22%2C%22id-mapping%22%2C%22ddp_id_mapping_pathology.tsv%22%5D) <br/>
|_ `"phi_data_lake"."cdm-data"."id-mapping"."ddp_id_mapping_pathology.tsv"` <br/>

<b>Summary Statistics:</b>

Total number of rows: 199,989 <br/>
Total number of unique patients: 101,377 <br/>
Total number of unique IMPACT sample_ids: 199,986 <br/>


1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Rules](#rules)

1. [Description ](#description)
2. [Assumptions ](#assumptions)
3. [Vocabulary \& Encoding ](#vocabulary--encoding)
4. [Notes ](#notes)

## Description <a name="description"></a>

Expand All @@ -42,7 +40,7 @@ Reference CDSI documentation - [CDM Codebook](https://docs.google.com/spreadshee
| `SAMPLE_ID` | Identifies an IMPACT sample | ID | string |


## Rules <a name="rules"></a>
## Notes <a name="notes"></a>

1. MRNs must be zero padded to eight digits. (They are compared as strings, not integers.)
2. A single MRN can have multiple IMPACT samples associated with it.
Expand Down
50 changes: 50 additions & 0 deletions clinical-data-mining/demographics.md
Original file line number Diff line number Diff line change
@@ -1 +1,51 @@
# Demographics

<b>Path:</b> `"phi_data_lake"."cdm-data".demographics."ddp_demographics.tsv"` <br/>
<b>Table Type:</b> `Live` <br/>
<b>Late updated:</b> `2024-05-17` <br/>

<b>Lineage: ([SQL](sql/demographics.sql))</b>

`CDM NLP Processes` <br/>
|_ `"phi_data_lake"."cdm-data"."id-mapping"."ddp_id_mapping_pathology.tsv"` <br/>

<b>Summary Statistics:</b>

Total number of rows: 121,855 <br/>
Total number of unique patients: 121,855 <br/>

1. [Description ](#description)
2. [Assumptions ](#assumptions)
3. [Vocabulary \& Encoding ](#vocabulary--encoding)
4. [Notes ](#notes)

## Description <a name="description"></a>

Provides a mapping between MRN and the patient's demographics

## Assumptions <a name="assumptions"></a>

No known assumptions.


## Vocabulary & Encoding <a name="vocabulary"></a>

Reference CDSI documentation - [CDM Codebook](https://docs.google.com/spreadsheets/d/1po0GdSwqmmXibz4e-7YvTPUbXpi0WYv3c2ImdHXxyuc/edit#gid=187767892)

| **Field name** | **Description** | **Field Type** | **Encoding** |
|---|---|---|---|
| `MRN` | Medical Record Number, a unique identifier per patient | ID | string |
| `PT_BIRTH_DTE` | Date of patient's birth | date | string |
| `PT_DEATH_DTE` | Date of patient's death | date | string |
| `MRN_CREATE_DTE` | Date MRN was assigned to patient(?) | date | string |
| `GENDER` | Gender of the patient | `MALE` or `FEMALE` | string |
| `MARITAL STATUS` | Marital status of the patient | `SINGLE`, `MARRIED`, `DIVORCED`, or `WIDOWED` | string |
| `RELIGIION` | Religion of the patient | | string |
| `RACE` | Race of the patient | | string |
| `ETHNICITY` | Ethnicity of the patient | | string |
| `CURRENT_AGE_DEID` | Age of the patient | age (in years) | string |


## Notes <a name="notes"></a>

1. If a date does not exist (for example, if the patient is alive) the field will contain empty text.
10 changes: 5 additions & 5 deletions clinical-data-mining/pathology_diagnoses.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
<b>Table Type:</b> `Live` <br/>
<b>Late updated:</b> `2024-07-10` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/pathology_diagnoses.sql))</b>

`CDM NLP Processes` <br/>
|_ ["phi_data_lake"."cdm-data".pathology."table_pathology_surgical_samples_parsed_specimen.tsv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22cdm-data%22%2C%22pathology%22%2C%22table_pathology_surgical_samples_parsed_specimen.tsv%22%5D) <br/>
|_ `"phi_data_lake"."cdm-data".pathology."table_pathology_surgical_samples_parsed_specimen.tsv"` <br/>

<b>Summary Statistics:</b>

Expand All @@ -18,8 +18,8 @@ Total number of unique parts: 832,946 <br/>

1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Rules](#rules)
3. [Vocabulary \& Encoding](#vocabulary--encoding)
4. [Notes](#notes)


## Description <a name="description"></a>
Expand Down Expand Up @@ -48,7 +48,7 @@ Reference CDSI documentation - [CDM Codebook](https://docs.google.com/spreadshee



## Rules <a name="rules"></a>
## Notes <a name="notes"></a>

1. MRNs are not zero padded, so they should not be matched to MRNs in other tables.
2. A single MRN can have multiple IMPACT samples associated with it.
Expand Down
9 changes: 4 additions & 5 deletions clinical-data-mining/pathology_reports.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,21 @@
<b>Table Type:</b> `Live` <br/>
<b>Late updated:</b> `2024-05-17` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/pathology_reports.sql))</b>

`CDM NLP Processes` <br/>
|_ ["phi_data_lake"."cdm-data".pathology."table_pathology_impact_sample_summary_dop_anno.tsv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22cdm-data%22%2C%22pathology%22%2C%22table_pathology_impact_sample_summary_dop_anno.tsv%22%5D) <br/>
|_ `"phi_data_lake"."cdm-data".pathology."table_pathology_impact_sample_summary_dop_anno.tsv"` <br/>

<b>Summary Statistics:</b>

Total number of rows: 200,451 <br/>
Total number of unique patients: 101,605 <br/>
Total number of unique IMPACT sample_ids: 200,448 <br/>


1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Notes](#notes)
3. [Vocabulary \& Encoding](#vocabulary--encoding)
4. [Notes](#notes)


## Description <a name="description"></a>
Expand Down
1 change: 1 addition & 0 deletions clinical-data-mining/sql/ddp_id_mapping.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT * FROM "ddp_id_mapping_pathology.tsv"
1 change: 1 addition & 0 deletions clinical-data-mining/sql/demographics.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT * FROM "ddp_demographics.tsv"
1 change: 1 addition & 0 deletions clinical-data-mining/sql/pathology_diagnoses.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT * FROM "table_pathology_surgical_samples_parsed_specimen.tsv"
1 change: 1 addition & 0 deletions clinical-data-mining/sql/pathology_reports.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT * FROM "table_pathology_impact_sample_summary_dop_anno.tsv"
12 changes: 6 additions & 6 deletions hobbit/hobbit-casebreakdown-cleaned.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
<b>Table Type:</b> Live <br/>
<b>Late updated:</b> 2024-05-17 <br/>

<b>Lineage:</b>
<b>Lineage:([SQL](sql/hobbit-casebreakdown-cleaned.sql))</b>

HoBBit SQL Server <br/>
|_ ["hobbit-poc"."case_breakdown"](hobbit-casebreakdown.md) <br/>
&nbsp;&nbsp;&nbsp;&nbsp;|_ ["hobbit-poc"."case_breakdown_cleaned"](https://tlvidreamcord1:9047/new_query?context=%22pathology-data-mining%22&queryPath=%5B%22pathology-data-mining%22%2C%22impact_slide%22%2C%22case_breakdown_cleaned%22%5D) <br/>
&nbsp;&nbsp;&nbsp;&nbsp;|_ `"pathology-data-mining"."impact_slide"."case_breakdown_cleaned"` <br/>

<b>Summary Statistics:</b>

Expand All @@ -18,10 +18,10 @@ Total number of unique slides: 6,235,731 <br/>

Last updated July 1, 2024. (New slides are typically added weekly.)

1. [Description](#description)
1. [Description ](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Rules](#rules)
3. [Vocabulary \& Encoding](#vocabulary--encoding)
4. [Notes](#notes)

## Description <a name="description"></a>

Expand All @@ -42,6 +42,6 @@ For example, it is assumed there cannot be a slide with two different stain grou

See the datasheet for the parent [hobbit-case-breakdown](hobbit-casebreakdown.md) dataset.

## Rules
## Notes <a name="notes">

See the datasheet for the parent [hobbit-case-breakdown](hobbit-casebreakdown.md) dataset.
8 changes: 4 additions & 4 deletions hobbit/hobbit-casebreakdown.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
<b>Table Type:</b> `Live` <br/>
<b>Late updated:</b> `2024-05-17` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/hobbit-casebreakdown.sql))</b>

`HoBBit SQL Server` <br/>
|_ ["hobbit-poc"."case_breakdown"](https://tlvidreamcord1:9047/new_query?context=%22hobbit-poc%22&queryPath=%5B%22hobbit-poc%22%2C%22case_breakdown%22%5D) <br/>
|_ `"hobbit-poc"."case_breakdown"` <br/>

<b>Summary Statistics:</b>

Expand All @@ -21,7 +21,7 @@ Total number of unique slides: 6,192,174 <br/>
1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Rules](#rules)
3. [Notes](#notes)


## Description <a name="description"></a>
Expand Down Expand Up @@ -88,7 +88,7 @@ The columns below are relevant to clinical operations and may not be useful for
| status_id | | ID | string | |
| captured_datatime | date and time when the image was captured by the scanner | date & time | datetime | |

# Rules <a name="rules"></a>
# Notes <a name="notes"></a>

1. Not all slides created at MSK are scanned and represented in this dataset.
2. Not all slides in this dataseet can be used for research. About 1% of the slides cannot be de-identified and therefore cannot be used for research.
Expand Down
4 changes: 4 additions & 0 deletions hobbit/sql/hobbit-casebreakdown-cleaned.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-- This assumes that if two rows have the same image_id but different values in any other column, those rows and those image_ids are garbage and are being discarded
-- For example there can be two rows with the only difference being stain_type
select * from (select DISTINCT * from "hobbit-poc"."case_breakdown") where image_id not in
(select image_id from (select DISTINCT * from "hobbit-poc"."case_breakdown") GROUP BY image_id having count(image_id) <> 1)
1 change: 1 addition & 0 deletions hobbit/sql/hobbit-casebreakdown.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
select cases.* from hobbit_prod.DMSKPWAP.tmp.case_breakdown cases
64 changes: 18 additions & 46 deletions pathology-data-mining/master_slide_inventory.md
Original file line number Diff line number Diff line change
@@ -1,63 +1,35 @@
# Master Slide Inventory

Last updated 2024-07-08
<b> Path:</b> <br/>
<b>Table Type:</b> `Live` <br/>
<b>Last updated 2024-07-08</b> <br/>

<b>Lineage: ([SQL](sql/master_slide_inventory.sql))</b>


<b>Summary Statistics:</b>

Total number of rows: 469,703 <br/>
Total number of unique slides: 461,184 <br/>
Total number of unique patients: 56,858 <br/>
Total number of samples: 65,892 <br/>

1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Rules](#rules)
2. [Vocabulary and Encoding](#vocabulary)
3. [Notes](#notes)

## Description <a name="description"></a>

### Motivation

This dataset lists the WSI data that we have on our local storage systems. (As of June
2024, that generally means `/gpfs/mskmind_emc/data_large/`) Each row of this table represents a
single slide and includes data such as the slide's id, project, magnification, cancer
type, and its storage location.

### How was this data put together?

### How should this data be used?

### Access
This dataset is available in Dremio at
`"pathology-data-mining"."master_slide_inventory.md"`

### How often is this data updated
This table is updated manually when new slides are received from the pathology department.
Currently, that's about once a week.


## Assumptions <a name="assumptions"></a>


## Vocabulary & Encoding <a name="vocabulary"></a>

See the datasheet for the parent [hobbit-case-breakdown](../hobbit/hobbit-casebreakdown.md) dataset.

## Rules <a name="rules"></a>


## Statistics

There are a total of 469,703 rows, corresponding to data from 65,892 samples from 56,858 patients. In total, there are 461,184 slides.

```
-- Row count
select count(*) FROM "pathology-data-mining"."impact_slide"."impact_slide"

-- sample Count
select count(DISTINCT(SAMPLE_ID)) FROM "pathology-data-mining"."impact_slide"."impact_slide"

-- patient Count
select count(DISTINCT(PATIENT_ID)) FROM "pathology-data-mining"."impact_slide"."impact_slide"

-- slide Count
select count(DISTINCT(IMAGE_ID)) FROM "pathology-data-mining"."impact_slide"."impact_slide"


```


## Notes <a name="notes"></a>


See the datasheet for the parent [hobbit-case-breakdown](../hobbit/hobbit-casebreakdown.md) dataset.
2 changes: 1 addition & 1 deletion pathology-data-mining/ocra/ocra_master_table.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<b>Table Type:</b> `contains live datasets in lineage` <br/>
<b>Last updated:</b> `2024-07-06` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/ocra_master_table.sql))</b>

["pathology-data-mining".impact_slide.case_breakdown_cleaned](https://github.com/msk-mind/datasheets-for-datasets/blob/main/hobbit/hobbit-casebreakdown-cleaned.md) (as t1) <br/>
["phi_data_lake"."cdm-data".pathology."table_pathology_impact_sample_summary_dop_anno.tsv"](https://github.com/msk-mind/datasheets-for-datasets/blob/main/clinical-data-mining/pathology_reports.md) (as t2) <br/>
Expand Down
6 changes: 3 additions & 3 deletions pathology-data-mining/ocra/rachel_grisham_brca_cohort.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
<b>Table Type:</b> `Static` <br/>
<b>Last updated:</b> `2024-07-06` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/rachel_grisham_brca_cohort.sql))</b>

Dr. Rachel Grisham <br/>
|_ [phi_data_lake.ocra."HRD_Shah_cohort.csv"](https://tlvidreamcord1:9047/new_query?context=%22phi_data_lake%22&queryPath=%5B%22phi_data_lake%22%2C%22ocra%22%2C%22HRD_Shah_cohort.csv%22%5D) <br/>
|_ `phi_data_lake.ocra."HRD_Shah_cohort.csv"` <br/>

<b>Summary Statistics:</b>

Expand All @@ -18,7 +18,7 @@ Total number of unique patients: 105 <br/>
1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Notes](#notes)
4. [Notes](#notes)


## Description <a name="description"></a>
Expand Down
6 changes: 3 additions & 3 deletions pathology-data-mining/ocra/rachel_grisham_cohort.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
<b>Table Type:</b> `Static` <br/>
<b>Late updated:</b> `2024-05-17` <br/>

<b>Lineage:</b>
<b>Lineage: ([SQL](sql/rachel_grisham_cohort.sql))</b>

Dr. Rachel Grisham <br/>
|_ [OCRA."HRD_RG_data"](https://tlvidreamcord1:9047/new_query?context=%22OCRA%22&queryPath=%5B%22OCRA%22%2C%22HRD_RG_data%22%5D) <br/>
|_ `OCRA."HRD_RG_data"` <br/>

<b>Summary Statistics:</b>

Expand All @@ -18,7 +18,7 @@ Total number of unique patients: 426 <br/>
1. [Description](#description)
2. [Assumptions](#assumptions)
3. [Vocabulary and Encoding](#vocabulary)
3. [Notes](#notes)
4. [Notes](#notes)


## Description <a name="description"></a>
Expand Down
Loading