This project, "Medical Image Captioning," aims to develop an advanced AI-based system capable of generating accurate and contextually relevant captions for complex medical images, such as chest X-rays. It integrates state-of-the-art computer vision and natural language processing techniques to enhance diagnostic workflows.
This project leverages transformer models like ViT+GPT2, BLIP-2, and GIT to generate clinically accurate captions for medical images. It bridges the gap between vision and language tasks, enhancing radiological workflows and improving diagnostic precision. The approach addresses challenges like limited labeled datasets and the need for interpretability in healthcare applications.
- Dataset: Indiana University Chest X-ray Collection (IU Chest X-ray).
- Image Preprocessing:
- Resizing images to a uniform resolution.
- Normalization based on model requirements.
- Retaining only frontal-view X-rays for consistency.
- Text Preprocessing:
- Extracting "Findings" and "Impressions" sections from reports.
- Tokenizing text using relevant model tokenizers.
- Combines Vision Transformer (ViT) and GPT-2.
- Used for multimodal tasks like image captioning.
- Integrates a Vision Transformer, Q-Former, and the OPT-2.7B language model.
- Focuses on fine-tuning the Q-Former for better visual-text alignment.
- Employs CLIP for vision encoding and GPT-style transformers for text generation.
- Optimized for efficient image-to-text generation.
- Mixed precision training with AdamW optimizer.
- Regular checkpoints for saving and evaluating model performance.
The project evaluated models using BLEU and ROUGE metrics:
Model | BLEU-1 | BLEU-4 | ROUGE-1 | ROUGE-L |
---|---|---|---|---|
ViT+GPT2 | 0.4955 | 0.2254 | 0.2634 | 0.1424 |
BLIP-2 | 0.6579 | 0.3941 | 0.3534 | 0.2408 |
GIT | 0.7455 | 0.4627 | 0.3809 | 0.2695 |
GIT achieved the highest overall performance, demonstrating its capability in generating precise and clinically relevant captions.
This work illustrates the potential of AI-driven systems to revolutionize medical diagnostics by automating image interpretation. Future plans include:
- Integrating more advanced models for better feature extraction.
- Experimenting with new training strategies and loss functions.
- Enhancing model explainability to build trust and transparency in medical AI solutions.
Key references include:
- RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting.
- VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning.
- BLIP-2: Bootstrapped Language-Image Pre-training.
See the full bibliography in the report for additional sources.
- Clone this repository.
- Install dependencies from
requirements.txt
. - Prepare the dataset and place it in the
data/
directory. - Run the training script for the desired model using:
python train.py --model <model_name>