For pre-training, we use the LLaVA-558K to pretrain the mlp connector.
For pre-finetuning, we use the ALLaVA caption data to warm-up the whole CuMo model.
For the visual intruction tuning stage, we use the a mixture of datasets for training:
Please download these datasets following the instructions and the json files. The datasets are structured as:
CuMo
├── cumo
├── scripts
├── checkpoints
│ ├── CuMo-mistral-7b
│ ├── CuMo-mixtral-8x7b
├── data
│ ├── llava
│ │ ├── llava_pretrain
| │ │ ├── images
| │ │ ├── blip_laion_cc_sbu_558k.json
│ ├── jsons
│ │ ├── cumo_pft_allava.json
│ │ ├── cumo_vit_1649K.json
│ ├── coco
│ ├── gqa
│ ├── ocr_vqa
│ ├── textvqa
│ ├── share_textvqa
│ ├── vg
│ ├── gpt4v-dataset
│ ├── sam
│ ├── sharegpt4v
│ ├── wikiart
│ ├── web-celebrity
│ ├── web-landmark
│ ├── ALLaVA
│ ├── docvqa
│ ├── chartqa
│ ├── dvqa
│ ├── ai2d
│ ├── infovqa
│ ├── lima
│ ├── syndog-en
│ ├── eval
│ ├── ...
You can set $CuMo_DIR to specify the path to the root directory of the project.
After downloading the datasets and the JSON files, you can proceed to train the model using the following commands. Taking CuMo Mistral-7B as an example, the first step is to pre-train the MLP connector.
bash scripts/cumo/mistral_7b/pretrain_mistral_7b.sh
The next step is to pre-finetune the whole model,
bash scripts/cumo/mistral_7b/pft_mistral_7b.sh
The final step is the visual instruction tuning stage,
bash scripts/cumo/mistral_7b/sft_mistral_7b.sh
Note that these scripts are for training the model on a single node of 8xA100s. If you want to train the model on multiple nodes, you can use the deepspeed multi-node trainings with added hostfile in the scripts.
We evaluate CuMo models on multiple benchmarks and many scripts are based on LLaVA evaluation settings. We've adapted some of them into multi-GPU evaluation scripts and added evaluations on MMMU and Mathvista. You can download the checkpoints for CuMo mistral-7b / mixtral-8x7b models and follow the evaluation instructions in LLaVA to download the datasets accordingly. The datasets are structured as:
CuMo
├── cumo
├── scripts
├── checkpoints
│ ├── CuMo-mistral-7b
│ ├── CuMo-mixtral-8x7b
├── data
│ ├── eval
│ ├── ├── scienceqa
│ ├── ├── textvqa
│ ├── ├── pope
│ ├── ├── mme
│ ├── ├── gqa
│ ├── ├── seed
│ ├── ├── vqav2
│ ├── ├── mmvet
│ ├── ├── ...
Then run the following commands to evaluate the models. Here are examples based on CuMo Mistral-7b:
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/sqa_m.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/textvqa_m.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/pope_m.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/mme_m.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/gqa.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/seed.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Multi-gpu inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/cumo/eval/vqav2.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Single-gpu inference
CUDA_VISIBLE_DEVICES=0 sh scripts/cumo/eval/mmvet.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Then submit the cumo_mistral_7b.json to MM-Vet Evluator.
Single-gpu inference
CUDA_VISIBLE_DEVICES=0 sh scripts/cumo/eval/llavabench.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Note that we use gpt-4-0613 for evaluation and you may specify your own API key for evaluation.
Single-gpu inference
CUDA_VISIBLE_DEVICES=0 sh scripts/cumo/eval/mmbench.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Then submit the result to the evaluation server.
Single-gpu inference
CUDA_VISIBLE_DEVICES=0 sh scripts/cumo/eval/mmbench_cn.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Then submit the result to the evaluation server.
Single-gpu inference
CUDA_VISIBLE_DEVICES=0 sh scripts/cumo/eval/mmmu.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Single-gpu inference
CUDA_VISIBLE_DEVICES=0 sh scripts/cumo/eval/mathvista.sh $CuMo_DIR/checkpoints/CuMo-mistral-7b mistralai/Mistral-7B-Instruct-v0.2
Note that we use gpt-3.5-turbo for evaluation and you may specify your own API key for evaluation.