Code for paper "LLMs Can Evolve Continually on Modality for X-Modal Reasoning" NeurIPS2024🎉
[2024.11] 🔥 Release code and checkpoints.
TODO:
- Train&Test code.
- Data Processing.
- Checkpoints.
Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modalpath switching and expansion abilities that enables MLLMs to continually evolve on modalities for X-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
Our depth data are generated follow the instruction of OneLLM.
-
Step 1: Download CC3M data based on this json file (the entire data requires too much disk space).
-
LLaVA:
training_data: llava_instruct_50k_train_data.json
-
CC3M:
training_data: CC3M_train_50k_clear_depth_complete.json
val_data: CC3M_val_1_5k_coco.json
-
Fusion (CC3M + LlaVA): It is recommended to prioritize the use of this data, as data mixing has already been completed.
traning_data: fusion_train_55w_clear.json
-
-
Step 2: Follow the installation guidance of DPT
-
Step 3: Generate depth data using the following scripts or DIY.
Training_data: Run the script
bash run_scripts/data_processing/depth_generation_1.sh
val_data: Run the script
bash run_scripts/data_processing/depth_generation_val_cc3m.sh
All checkpoints can be found in Google Driver.
Tips
Before testing, please change the checkpoint path in the following direction:
lavis/projects/xinstruct_blip/train/vicuna7b
Also change the path in:
lavis/projects/xinstruct_blip/eval/vicuna7b
We marked all the path with: "path_to_your_data".
Example:
Run the script bash run_scripts/ours/video/test_video_modality.sh
Tips
Before training, please check the data direction in the following direction:lavis/configs/datasets/depth
You also need to change the file diresction in:
lavis/datasets/datasets/depth_vqa_dataset.py
lavis/tasks/captioning.py
We marked all the path with: "path_to_your_data".
Example:
Run the script bash run_scripts/ours/video/train_video_modality.sh
- Note: Need to load the parameters of previous modality for current training.
@article{yu2024llms,
title={LLMs Can Evolve Continually on Modality for X-Modal Reasoning},
author={Yu, Jiazuo and Xiong, Haomiao and Zhang, Lu and Diao, Haiwen and Zhuge, Yunzhi and Hong, Lanqing and Wang, Dong and Lu, Huchuan and He, You and Chen, Long},
journal={arXiv preprint arXiv:2410.20178},
year={2024}
}
Our repo is built on X-InstructBLIP and OneLLM. We thank the authors for sharing their codes.