This folder contains examples of processed dataset and propsed prompt (train_prompts_all.json)

We resized the image ensuring that the shorter side measures 512 pixels with the same aspect ratio

Pre-training datasets

MIMIC-CXR

Download mimic-cxr-jpg in Physionet.
Extract "findings" and "impression" sections as list of texts.
Split train/val/test set with mimic-cxr-2.0.0-split.csv.gz.
For utilizing multi-view, get view position with mimic-cxr-2.0.0-metadata.csv.gz.
Text Augmentation with text_augmentation/back_translation.py.

The final csv file contains following columns.

index	image	view	AP	PA	Lateral	text	text_augment
0	List of image_path	List of views	List of AP image_path	List of PA image_path	List of Lateral image_path	List of [findings, impression]	Result of backtranslation of text

CheXpert

Download the dataset in CheXpert.
Get chexpert_5x200.csv from GLoRIA stroage.

Exclude the patients in chexpert_5x200.csv from training set as follows.

click to expand

import os
import pandas as pd

WORK_DIR = "/path/to/load/chexpert"
df_train = pd.read_csv(os.path.join(WORK_DIR, "CheXpert-v1.0", "train.csv"))
df_test = pd.read_csv(os.path.join(WORK_DIR, "chexpert_5x200.csv"))

df_train["patient_id"] = df_train["Path"].apply(lambda x: x.split("/")[-3])
df_test["patient_id"] = df_test["Path"].apply(lambda x: x.split("/")[-3])
print(f"# image : {len(df_train)}, # patient : {df_train['patient_id'].nunique()}")  # # image : 223414, # patient : 64540
print(f"# image : {len(df_test)}, # patient : {df_test['patient_id'].nunique()}")  # # image : 1000, # patient : 966

df_train = df_train[~df_train["patient_id"].isin(df_test["patient_id"].tolist())]
print(f"# image : {len(df_train)}, # patient : {df_train['patient_id'].nunique()}")  # # image : 216478, # patient : 63574

Get text_label as triplet [[positives], [negatives], [uncertain]].
For evalutating chexpert_5x200, change Path to fit your datapath.

The final csv file contains following columns.

index	image	view	AP	PA	Lateral	text_label
0	List of image_path	List of views	List of AP image_path	List of PA image_path	List of Lateral image_path	[[positive labels], [negative labels], [uncertain labels]]

Chest-Xray14

Download the dataset in ChestXray-NIHCC.
Get image label with Data_Entry_2017_v2020.csv.

Split train/valid as follows.

click to expand

import os
import pandas as pd

from sklearn.model_selection import train_test_split

WORK_DIR = "/path/to/load/chest14"
df = pd.read_csv(os.path.join(WORK_DIR, "Data_Entry_2017_v2020.csv"))
df.set_index("Image Index", inplace=True)

files_train = []
with open(os.path.join(WORK_DIR, "train_val_list.txt"), "r") as file:
    for line in file.readlines():
        filename = line.replace("\n", "")
        files_train.append(filename)
df_train = df.loc[files_train, :]
df_train.reset_index(drop=False, inplace=True)

unique_ids = df_train["Patient ID"].unique()
train_ids, valid_ids = train_test_split(unique_ids, test_size=0.2, random_state=0)

df_train = df_train.set_index("Patient ID")
df_train, df_valid = df_train.loc[train_ids, :], df_train.loc[valid_ids, :]
df_train.reset_index(drop=False, inplace=True)
df_valid.reset_index(drop=False, inplace=True)

df_train.to_csv(os.path.join(WORK_DIR, "chest14_train.csv"))
df_valid.to_csv(os.path.join(WORK_DIR, "chest14_valid.csv"))

The final csv file contains following columns.

index	image	text_label
0	List of image_path	[positive labels]

Downstream datasets

VinDr-CXR

Download VinDr-CXR in Physionet.

dcm2png

click to expand

import numpy as np

import cv2
import pydicom
from pydicom.pixel_data_handlers.util import apply_modality_lut, apply_voi_lut


def resize_and_save(load_path, save_path):  # load_path=/path/to/load/*.dicom, save_path=/path/to/save/*.jpg
    ds = pydicom.dcmread(load_path, force=True)
    img = ds.pixel_array
    img = apply_modality_lut(img, ds)  # rescaleSlope & intercept
    img = apply_voi_lut(img, ds)  # windowing
    if hasattr(ds, "PhotometricInterpretation"):
        if ds.PhotometricInterpretation.lower().strip() == "monochrome1":
            img = img.max() - img  # invert
    
    h, w = img.shape
    ratio = 512 / min(h, w)
    target_size = (int(w * ratio), int(h * ratio))
    img = cv2.resize(img, target_size, cv2.INTER_LANCZOS4)
   
    # normalize
    img = (img - img.min()) / (img.max() - img.min()) * np.iinfo(np.uint8).max
    img = img.astype(np.uint8)
    cv2.imwrite(save_path, img)

Get Label with image_labels_{train/test}.csv

Get valid set from training set

click to expand

import os
import pandas as pd

from sklearn.model_selection import train_test_split

WORK_DIR = "/path/to/load/vindr-cxr"

df = pd.read_csv(os.path.join(WORK_DIR, "1.0.0", "annotations", "image_labels_train.csv"))

df = df.groupby("image_id").agg(sum)
df.loc[:, "Aortic enlargement":"No finding"] = (df.loc[:, "Aortic enlargement":"No finding"] > 0).astype(int)
df.reset_index(drop=False, inplace=True)

df_train, df_valid = train_test_split(df, test_size=0.2, random_state=0)

df_train.to_csv(os.path.join(WORK_DIR, "vindr_train.csv"))
df_valid.to_csv(os.path.join(WORK_DIR, "vindr_valid.csv"))

The final csv file contains following columns.

index	image	label	class
0	image_path	List of label value (0 or 1)	List of positive classname

RSNA Pneumonia

Download RSNA dataset in Kaggle
dcm2png - Details in Vindr-CXR
get label from stage_2_train_labels.csv
get class from stage_2_detailed_class_info.csv if "class" is "Lung Opacity", class is "Pneumonia" else "Normal"
Split train/valid/test following GLoRIA preprocess code.

The final csv file contains following columns.

index	image	label	class
0	image_path	label value (0 or 1)	{Pneumonia/ Normal}

SIIM Pneumothorax

Download SIIM dataset in Kaggle.
dcm2png - Details in Vindr-CXR
with "EncodedPixels" in stage_2_train.csv, get label.
Split train/valid/test following GLoRIA preprocess code.

index	image	label	class
0	image_path	label value (0 or 1)	{Pneumothorax/ No Pneumothorax}

Image-text evaluation dataset

OpenI

Download dcm files and reports on OpenI
dcm2png - Details in Vindr-CXR, We only use frontal images (3,955 image-text pairs)

From .xml file extract a corresponding report.

click to expand

import xmltodict

def extract_report_from_xml(load_path):  # load_path=/path/to/load/*.xml, return report
     with open(load_path) as fd:
         data = xmltodict.parse(fd.read())

     abstract = data["eCitation"]["MedlineCitation"]["Article"]["Abstract"]["AbstractText"]
     comparison, indication, finding, impression = abstract

     report = max(finding["#text"], impression["#text"])
     return report

The final csv file contains following columns.

index	image	text
0	image_path	max(finding, impression)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pre-training datasets

MIMIC-CXR

CheXpert

Chest-Xray14

Downstream datasets

VinDr-CXR

RSNA Pneumonia

SIIM Pneumothorax

Image-text evaluation dataset

OpenI

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pre-training datasets

MIMIC-CXR

CheXpert

Chest-Xray14

Downstream datasets

VinDr-CXR

RSNA Pneumonia

SIIM Pneumothorax

Image-text evaluation dataset

OpenI