This repository contains the code, data and model for the paper titled "GalleryGPT: Analyzing Paintings with Large Multimodal Models".
- [2024-7-18] We released code, PaintingForm dataset and GalleryGPT LoRA checkpoint.
- [2024-7-16] The paper has been accepted by ACM MM 2024 (Oral, Top 3.97% 🎉🎉🎉).
cd GalleryGPT
conda create -n gallery_gpt python=3.10 -y
conda activate gallery_gpt
pip install -e .
pip install protobuf
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Download PaintingForm dataset. "train_samples_tuning.json" corresponds to the annotations of painting formal analysis for instruction finetuning.
The overall pipeline of constructing our PaintingForm:
Place the data in the root or other directory. Data structure:
├── art_images_data/
│ ├── images/0.png
│ ├── images/1.png
│ ├── ...
├── train_samples_tuning.json
Download ShareGPT4V-7B as base model and place in ./share4v/llava-7b, then replace config.json with ours in root directory. Run script:
sh finetune_task_lora.sh
You can download base model then replace config.json with ours in root directory, and LoRA checkpoint. For inference:
cd llava/eval
python run_llava.py --model-path llava-lora-model --model-base share4v/llava-7b --image-file your/image/path --query
Captioning metrics on our test set:
Model | BLEU | GLEU | METEOR | ROUGE |
---|---|---|---|---|
LLaVA-1.5-7B | 9.87 | 14.59 | 26.19 | 26.37 |
Qwen-VL-Chat-7B | 13.65 | 16.42 | 29.78 | 26.72 |
ShareGPT4V-7B | 12.38 | 16.14 | 31.53 | 26.63 |
GalleryGPT-7B | 21.23 | 21.68 | 37.62 | 31.34 |
Formal analysis generation:
Dialogue examples:
The project is built on top of the amazing LLaVA repository and ShareGPT4V repository. Thanks for their contributions.
If you find our work helpful to your research, please consider citing us with this BibTeX:
@inproceedings{MM24GalleryGPT,
author = {Yi Bin and
Wenhao Shi and
Yujuan Ding and
Zhiqiang Hu and
Zheng Wang and
Yang Yang and
See-Kiong Ng and
Heng Tao Shen}
title = {GalleryGPT: Analyzing Paintings with Large Multimodal Models},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia, 28 October – 1 November, 2024, Melbourne, Australia.},
year = {2024},
}