Multimodal Large Language Models are Generalist Medical Image Interpreters

Multimodal Large Language Models are Generalist Medical Image Interpreters, medRxiv (2023). [Paper]
Tianyu Han*, Lisa C. Adams*, Sven Nebelung, Jakob Nikolas Kather, Keno K. Bressem*, and Daniel Truhn*

Han, T., Adams, L. C., Nebelung, S., Kather, J. N., Bressem, K. K., & Truhn, D. (2023). Multimodal Large Language Models are Generalist Medical Image Interpreters. medRxiv, 2023-12.

This repository contains code to probe and evaluate the multimodal large language models (LLMs) for their ability to interpret medical images in pathology, dermatology, ophthalmology, and radiology - focusing on two use cases within each discipline.

Dependencies

To clone all files:

git clone https://github.com/peterhan91/Multimodal-Probes

To install Python dependencies:

pip install -r requirements.txt

Main question: Can general-purpose multimodal LLMs take the lead in digital medicine?

Abstract

Background Recent developments in Vision-Language Models (VLMs) offer a new opportunity for the application of AI systems in healthcare. We aim to demonstrate the effectiveness of these general-purpose, large VLMs in interpreting medical images across key medical subspecialties—pathology, dermatology, ophthalmology, and radiology—without the need for specialized fine-tuning.
Methods We conducted a cross-sectional study to analyze image interpretation of large VLMs, focusing on Flamingo-80B, Flamingo-9B, and an OpenAI CLIP model. The study involved eight clinical tasks (T) across various medical specialties, utilizing 11 medical image datasets released between 2015 and 2022. These tasks include the classification of colorectal tissue, skin lesions, diabetic retinopathy, glaucoma, chest radiographs, and osteoarthritis. Additionally, 931 clinical cases from the NEJM Image Challenge (2005-2023) were evaluated to assess the VLMs' performance on clinical vignette questions. The primary outcomes were measured by F1 scores and the area under the receiver operating curve (AUC).
Results In our colorectal cancer (CRC) study (T1), we analyzed 107,180 histological images from 136 patients. In the pan-cancer study (T2), we examined 7,558 images from 19 organs. The Flamingo-80B model proved superior in identifying tissue types, outperforming CLIP representations and other models in CRC (F1 score: 0.892 vs 0.764) and pan-cancer cohorts (0.870 vs 0.797, P<.001). Importantly, Flamingo-80B also outperformed a domain-specific foundation model, which was pre-trained on Twitter, with an F1 score of 0.892 vs 0.877. In the study of pigmented skin lesions (T3) involving 11,720 images and melanoma (T4) with 33,126 images from 2,056 patients, Flamingo-80B also demonstrated higher accuracy, as shown by its AUC scores (average over skin lesions: 0.945 vs 0.892; melanoma: 0.885 vs 0.834, P<.001). In ophthalmology tasks (T4 & T5) involving over 44,350 patients for diabetic retinopathy (DR) and 57,770 for glaucoma, it significantly surpassed the baseline models (DR: 0.803 vs 0.725; glaucoma: 0.868 vs 0.716, P<.001). For chest radiographic conditions (T7) with 67,247 participants and osteoarthritis (OA, T8) involving 7,520 patients, Flamingo-80B consistently achieved the highest AUC among all models: Radiographic conditions: 0.781 vs 0.560 and OA: 0.810 vs 0.714, respectively.
Conclusions Our results show that non-domain-specific, publicly available vision-language models effectively analyze diverse medical images without fine-tuning, challenging the need for task-specific models.

Evidence 1: NEJM Image Challenge benchmark

Flamingo-80B can correctly answer and reason over 40% of complex clinical questions presented in the NEJM Image Challenge.

We provide our benchmarking script for the NEJM Image Challenge.

nejm_image_challenge/nejm_test.py

Evidence 2: Probing experiments

Medical Imaging Datasets

Histopathology datasets Download images come from [Kather Colon] https://zenodo.org/record/1214456 and [PanNuke] https://warwick.ac.uk/fac/cross_fac/tia/data/pannuke. Preprocessing images and labels from PanNuke following the PLIP repo.
Dermatology datasets Download images come from [ISIC 2018] https://challenge.isic-archive.com/data/#2018 and [ISIC 2020] https://challenge.isic-archive.com/data/#2020.
Ophthalmology datasets Download images come from [EyePACS Diabetic Retinopathy Detection Challenge] https://www.kaggle.com/c/diabetic-retinopathy-detection/, [AIROGS] https://zenodo.org/records/5793241, [APTOS-2019] https://www.kaggle.com/c/aptos2019-blindness-detection, and [ODIR-2019] https://odir2019.grand-challenge.org/Download/.
Radiology datasets Download images come from [OAI] https://nda.nih.gov/oai/query-download, [MOST] https://most.ucsf.edu/multicenter-osteoarthritis-study-most-public-data-sharing, and [PadChest] https://bimcv.cipf.es/bimcv-projects/padchest/. Note: in order to gain access to the data, you must be a credentialed user as defined on [OAI] https://nda.nih.gov/oai and [MOST] https://agingresearchbiobank.nia.nih.gov/. In addition, knee joints were extracted from both datasets using a pretrained Hourglass Network from the KNEEL project.

Extracting Activations from Vision-Language Models

Run the following command to perform activation extraction.

python -u activation.py --model HuggingFaceM4/idefics-80b-instruct  --csv_file ./csvs/patho_kather.csv

or

python -u activation.py --model HuggingFaceM4/idefics-9b-instruct  --csv_file ./csvs/patho_kather.csv

Running Probe Training

Run the following command to train linear probes on layer activations.

python eval_torch_ex.py --dataset patho_kather --csv_path csvs/patho_kather.csv --save_path results_80B

Due to classes in PadChest are not mutually exclusive, we use a multi-label classification loss function to train the probe.

python eval_torch_padchest.py --save_path results_80B

Both scripts contain training and evaluating the probe.

Comparing Probes trained on CLIP features

Run the following command to train linear probes on CLIP features.

python clip_ex.py --dataset patho_kather --csv_path csvs/patho_kather.csv

Issues

Please open new issue threads specifying the issue with the codebase or report issues directly to than@ukaachen.de.

Citation

@article{han2023multimodal,
  title={Multimodal Large Language Models are Generalist Medical Image Interpreters},
  author={Han, Tianyu and Adams, Lisa C and Nebelung, Sven and Kather, Jakob Nikolas and Bressem, Keno K and Truhn, Daniel},
  journal={medRxiv},
  pages={2023--12},
  year={2023},
  publisher={Cold Spring Harbor Laboratory Press}
}

License

The source code for the site is licensed under the MIT license, which you can find in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
csvs		csvs
figs		figs
nejm_image_challenge		nejm_image_challenge
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
activation.py		activation.py
clip_ex.py		clip_ex.py
clip_padchest.py		clip_padchest.py
eval_torch_ex.py		eval_torch_ex.py
eval_torch_padchest.py		eval_torch_padchest.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Large Language Models are Generalist Medical Image Interpreters

Dependencies

Main question: Can general-purpose multimodal LLMs take the lead in digital medicine?

Evidence 1: NEJM Image Challenge benchmark

Evidence 2: Probing experiments

Medical Imaging Datasets

Extracting Activations from Vision-Language Models

Running Probe Training

Comparing Probes trained on CLIP features

Issues

Citation

License

About

Releases

Packages

Languages

License

peterhan91/Multimodal-Probes

Folders and files

Latest commit

History

Repository files navigation

Multimodal Large Language Models are Generalist Medical Image Interpreters

Dependencies

Main question: Can general-purpose multimodal LLMs take the lead in digital medicine?

Evidence 1: NEJM Image Challenge benchmark

Evidence 2: Probing experiments

Medical Imaging Datasets

Extracting Activations from Vision-Language Models

Running Probe Training

Comparing Probes trained on CLIP features

Issues

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages