Put competition datasets in {RAW_DATA_DIR}/rsna-breast-cancer-detection/
, default to datasets/raw/rsna-breast-cancer-detection/
rsna-breast-cancer-detection
├── sample_submission.csv
├── test.csv
├── test_images
├── train.csv
└── train_images
Using 5 external datasets, summary:
Dataset | num_patients* | num_samples* | num_pos_samples* |
---|---|---|---|
VinDr-Mammo | 5000 | 20000 | 226 (1.13 %) |
MiniDDSM | 1952 | 7808 | 1480 (18.95 %) |
CMMD | 1775 | 5202 | 2632 (50.6%) |
CDD-CESM | 326 | 1003 | 331 (33 %) |
BMCD | 82 | 328 | 22 (6.71 %) |
All | 9135 | 34341 | 4691 (13.66 %) |
***** The number may not indicate original dataset characteristics, but processed data I used for this competition.
Follow each dataset's link aboved to download each dataset. Then put those datasets to {RAW_DATA_DIR}/
, default of datasets/raw/
Notes:
- To download some datasets from https://wiki.cancerimagingarchive.net/ such as
CMMD
andCDD-CESM
, you can install and use the NBIA Data Retriever Command Line - Due to some issues of processing original xlsx metadata (could be tricky to read using python code), I put the
BMCD
andCMMD
csv label files inassets/data/
. You can find its content matching with original file, no hacky at all.
Restructure raw datasets to make those look like:
$ tree -L 2 datasets/raw
datasets/raw
├── bmcd
│ ├── Dataset
│ ├── Description.xlsx
│ ├── README.txt
│ ├── pwd_directory_structure.txt
│ └── pwd_structure.txt
├── cddcesm
│ ├── Low energy images of CDD-CESM
│ ├── Medical reports for cases .zip
│ ├── Radiology manual annotations.xlsx
│ ├── Radiology_hand_drawn_segmentations_v2.csv
│ ├── Subtracted images of CDD-CESM
│ ├── pwd_directory_structure.txt
│ └── pwd_structure.txt
├── cmmd
│ ├── CMMD
│ ├── CMMD_clinicaldata_revision.xlsx
│ ├── pwd_directory_structure.txt
│ └── pwd_structure.txt
├── miniddsm
│ ├── Data-MoreThanTwoMasks
│ ├── MINI-DDSM-Complete-JPEG-8
│ ├── MINI-DDSM-Complete-PNG-16
│ ├── pwd_directory_structure.txt
│ └── pwd_structure.txt
├── rsna-breast-cancer-detection
│ ├── sample_submission.csv
│ ├── stage1_images
│ ├── test.csv
│ ├── test_images
│ ├── train.csv
│ └── train_images
└── vindr
├── LICENSE.txt
├── SHA256SUMS.txt
├── breast-level_annotations.csv
├── finding_annotations.csv
├── images
├── metadata.csv
├── pwd_directory_structure.txt
└── pwd_structure.txt
For each external datasets, pre-generated pwd_directory_structure.txt
and pwd_structure.txt
was self-included in this repo. Take a look at these files to ensure everythings is correcly structured.
pwd_directory_structure.txt
: was generated byfind -L . -type d > pwd_directory_structure.txt
pwd_structure.txt
: was generated byfind -L . > pwd_structure.txt
Finally, copy some meta files from assets/
:
cp assets/data/bmcd_raw_label.csv datasets/raw/bmcd/label.csv
cp assets/data/cmmd_raw_label.csv datasets/raw/cmmd/label.csv