VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Project Page | Paper | Dataset

We introduce a new multimodal task named Interleaved Image-Text Comprehension (IITC), designed to evaluate a model's capability to handle interleaved image-text inputs that contain redundant and misleading information. To enhance and measure model performance on the IITC task, we developed the VEGA dataset. By fine-tuning Qwen-VL-Chat on the VEGA dataset, we created VEGA-Base, a strong baseline for the IITC task.

Dataset Structure

VEGA Datasets consists of 2 tasks, approximately 593,000 training examples and 2,326 test examples. You can download VEGA here.
Unzip the imgs.zip and you will get the folder.

.
├── datas
│   ├── IITC_4k_test.json
│   ├── IITC_4k_train.json
│   ├── IITC_8k_test.json
│   ├── IITC_8k_train.json
│   ├── ITA_3picture_C_train.json
│   ├── ITA_3picture_E_train.json
│   ├── ITA_3picture_F_train.json
│   ├── ITA_3picture_test.json
│   ├── ITA_5picture_C_train.json
│   ├── ITA_5picture_E_train.json
│   ├── ITA_5picture_F_train.json
│   └── ITA_5picture_test.json
├── imgs
│   ├── test_imgs
│   │   ├── 1001.0025v1
│   │   │   └── pdferror.png
│   │   ├── 1001.0357v1
│   │   │   └── Different_Capacity_regions_2dB.png
...
|   ├── train_imgs
...

The data in IITC*.json follows the following format:

{"id": "The paper's ID on arXiv.", 
"title": "The paper's title.", 
"caption": "The caption of correct image.",
"context": "Interleaved image-text input.",
"question": "Question about a specific image.", 
"answer": "The answer."
"image_paths": "List of image paths.",
"truth_fig_idx": "Index of the correct image in image_paths."
}

The data in ITA*.json follows the following format:

{"id": "List of paper's ID on arXiv.", 
"image_paths": "List of image paths.", 
"context": "Interleaved image-text input.", 
"answer": "The answer."
}

In all the JSON "context" fields, the picture is represented as "Picture id: <img>img_path</img>\n" where "id" indicates the position of the image in the conversation, starting from 1. For example:

{
"context": "...The result illustrated in Figure~6[Picture 1] shows that the proposed network extracting patches features separately performs significantly better than previous methods extracting patches feature together.\nPicture 1: <img>test_imgs/1803.06598v1/Figs/stack_LAN.png</img>\nFigure. 6 Picture 2: <img>test_imgs/1803.06598v1/Figs/SIR_VS_CR_curve.png</img>\nFigure. 7...", 
}

Evaluation

git clone https://github.com/zhourax/VEGA
cd VEGA
pip install nltk
pip install rouge

After setting your model in eval/IITC.py and eval/ITA.py.

bash eval/IITC.sh
bash eval/ITA.sh

Citation

@misc{zhou2024vegalearninginterleavedimagetext,
      title={VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models}, 
      author={Chenyu Zhou and Mengdan Zhang and Peixian Chen and Chaoyou Fu and Yunhang Shen and Xiawu Zheng and Xing Sun and Rongrong Ji},
      year={2024},
      eprint={2406.10228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.10228}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
eval		eval
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Dataset Structure

Evaluation

Citation

About

Releases

Packages

Languages

zhourax/VEGA

Folders and files

Latest commit

History

Repository files navigation

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Dataset Structure

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages