Skip to content

Latest commit

 

History

History
99 lines (80 loc) · 4 KB

DATASETS.md

File metadata and controls

99 lines (80 loc) · 4 KB

DATASETS

1. Competition datasets

Put competition datasets in {RAW_DATA_DIR}/rsna-breast-cancer-detection/, default to datasets/raw/rsna-breast-cancer-detection/

rsna-breast-cancer-detection
├── sample_submission.csv 
├── test.csv 
├── test_images
├── train.csv
└── train_images

2. External datasets

Using 5 external datasets, summary:

Dataset num_patients* num_samples* num_pos_samples*
VinDr-Mammo 5000 20000 226 (1.13 %)
MiniDDSM 1952 7808 1480 (18.95 %)
CMMD 1775 5202 2632 (50.6%)
CDD-CESM 326 1003 331 (33 %)
BMCD 82 328 22 (6.71 %)
All 9135 34341 4691 (13.66 %)

***** The number may not indicate original dataset characteristics, but processed data I used for this competition.

Follow each dataset's link aboved to download each dataset. Then put those datasets to {RAW_DATA_DIR}/, default of datasets/raw/

Notes:

  • To download some datasets from https://wiki.cancerimagingarchive.net/ such as CMMD and CDD-CESM, you can install and use the NBIA Data Retriever Command Line
  • Due to some issues of processing original xlsx metadata (could be tricky to read using python code), I put the BMCD and CMMD csv label files in assets/data/. You can find its content matching with original file, no hacky at all.

Restructure raw datasets to make those look like:

$ tree -L 2 datasets/raw

datasets/raw
├── bmcd
│   ├── Dataset
│   ├── Description.xlsx
│   ├── README.txt 
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── cddcesm
│   ├── Low energy images of CDD-CESM
│   ├── Medical reports for cases .zip
│   ├── Radiology manual annotations.xlsx
│   ├── Radiology_hand_drawn_segmentations_v2.csv
│   ├── Subtracted images of CDD-CESM
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── cmmd
│   ├── CMMD
│   ├── CMMD_clinicaldata_revision.xlsx
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── miniddsm
│   ├── Data-MoreThanTwoMasks
│   ├── MINI-DDSM-Complete-JPEG-8
│   ├── MINI-DDSM-Complete-PNG-16
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── rsna-breast-cancer-detection
│   ├── sample_submission.csv
│   ├── stage1_images
│   ├── test.csv
│   ├── test_images
│   ├── train.csv
│   └── train_images
└── vindr
    ├── LICENSE.txt
    ├── SHA256SUMS.txt
    ├── breast-level_annotations.csv
    ├── finding_annotations.csv
    ├── images
    ├── metadata.csv
    ├── pwd_directory_structure.txt
    └── pwd_structure.txt

For each external datasets, pre-generated pwd_directory_structure.txt and pwd_structure.txt was self-included in this repo. Take a look at these files to ensure everythings is correcly structured.

  • pwd_directory_structure.txt: was generated by find -L . -type d > pwd_directory_structure.txt
  • pwd_structure.txt: was generated by find -L . > pwd_structure.txt

Finally, copy some meta files from assets/:

cp assets/data/bmcd_raw_label.csv datasets/raw/bmcd/label.csv
cp assets/data/cmmd_raw_label.csv datasets/raw/cmmd/label.csv