DATASETS

1. Competition datasets

Put competition datasets in {RAW_DATA_DIR}/rsna-breast-cancer-detection/, default to datasets/raw/rsna-breast-cancer-detection/

rsna-breast-cancer-detection
├── sample_submission.csv 
├── test.csv 
├── test_images
├── train.csv
└── train_images

2. External datasets

Using 5 external datasets, summary:

Dataset	num_patients*	num_samples*	num_pos_samples*
VinDr-Mammo	5000	20000	226 (1.13 %)
MiniDDSM	1952	7808	1480 (18.95 %)
CMMD	1775	5202	2632 (50.6%)
CDD-CESM	326	1003	331 (33 %)
BMCD	82	328	22 (6.71 %)
All	9135	34341	4691 (13.66 %)

***** The number may not indicate original dataset characteristics, but processed data I used for this competition.

Follow each dataset's link aboved to download each dataset. Then put those datasets to {RAW_DATA_DIR}/, default of datasets/raw/

Notes:

To download some datasets from https://wiki.cancerimagingarchive.net/ such as CMMD and CDD-CESM, you can install and use the NBIA Data Retriever Command Line
Due to some issues of processing original xlsx metadata (could be tricky to read using python code), I put the BMCD and CMMD csv label files in assets/data/. You can find its content matching with original file, no hacky at all.

Restructure raw datasets to make those look like:

$ tree -L 2 datasets/raw

datasets/raw
├── bmcd
│   ├── Dataset
│   ├── Description.xlsx
│   ├── README.txt 
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── cddcesm
│   ├── Low energy images of CDD-CESM
│   ├── Medical reports for cases .zip
│   ├── Radiology manual annotations.xlsx
│   ├── Radiology_hand_drawn_segmentations_v2.csv
│   ├── Subtracted images of CDD-CESM
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── cmmd
│   ├── CMMD
│   ├── CMMD_clinicaldata_revision.xlsx
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── miniddsm
│   ├── Data-MoreThanTwoMasks
│   ├── MINI-DDSM-Complete-JPEG-8
│   ├── MINI-DDSM-Complete-PNG-16
│   ├── pwd_directory_structure.txt
│   └── pwd_structure.txt
├── rsna-breast-cancer-detection
│   ├── sample_submission.csv
│   ├── stage1_images
│   ├── test.csv
│   ├── test_images
│   ├── train.csv
│   └── train_images
└── vindr
    ├── LICENSE.txt
    ├── SHA256SUMS.txt
    ├── breast-level_annotations.csv
    ├── finding_annotations.csv
    ├── images
    ├── metadata.csv
    ├── pwd_directory_structure.txt
    └── pwd_structure.txt

For each external datasets, pre-generated pwd_directory_structure.txt and pwd_structure.txt was self-included in this repo. Take a look at these files to ensure everythings is correcly structured.

pwd_directory_structure.txt: was generated by find -L . -type d > pwd_directory_structure.txt
pwd_structure.txt: was generated by find -L . > pwd_structure.txt

Finally, copy some meta files from assets/:

cp assets/data/bmcd_raw_label.csv datasets/raw/bmcd/label.csv
cp assets/data/cmmd_raw_label.csv datasets/raw/cmmd/label.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATASETS.md

DATASETS.md

DATASETS

1. Competition datasets

2. External datasets

Files

DATASETS.md

Latest commit

History

DATASETS.md

File metadata and controls

DATASETS

1. Competition datasets

2. External datasets