This document provides a brief intro to the usage of VCoder LLaVA-1.5. Our code is based on original LLaVA, please checkout their repo for more information.
We add our VCoder to a pretrained LLaVA-1.5 model and train on the COST dataset.
LLaVA-1.5-7b
# Download the Projector weights store them inside outputs folder
git lfs install
mkdir outputs
cd outputs
git clone https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
LLaVA-1.5-13b
# Download the Projector weights store them inside outputs folder
git lfs install
mkdir outputs
cd outputs
git clone https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5
We provide training code for two variants of VCoder. We train all our models on 8 A100s.
-
Run
bash scripts/vcoder_train.sh
to train either of following variants on the COST dataset:- VCoder LLaVA-1.5-7b: We train the model for 2 epochs. The training time is ~8 hours.
- VCoder LLaVA-1.5-13b: We train the model for 2 epochs. The training time is ~14 hours.
-
Remember to set the model variant in scripts/vcoder_train.sh before training.
Note: These are the models which we use in our demo.
-
Run
bash scripts/vcoder_ds_train.sh
to train either of following variants on the combination of COST dataset and General Question Answering (for regularization) datasets.- VCoder-DS LLaVA-1.5-7b: We train the model for 1 epoch. The training time is ~17 hours.
- VCoder-DS LLaVA-1.5-13b: We train the model for 1 epoch. The training time is ~30 hours.
-
Remember to set the model variant in scripts/vcoder_ds_train.sh before training.
We evaluate our models on the COST val dataset. We have written our own evaluators for the same.
We evaluate on the semantic, instance and panoptic object perception tasks.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/cost.sh
Remember to set the model variant in scripts/v1_5/eval/cost.sh before evaluating.
We evaluate on the depth object perception tasks.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/cost_depth.sh
Remember to set the model variant in scripts/v1_5/eval/cost_depth.sh before evaluating.
- We follow the same evaluation setting from LLaVA-1.5.
- Download and unzip the eval files from google drive to
./playground/data/eval
. This also provides a general structure for all datasets.
# pip3 install gdown
cd playground/data/eval
gdown https://drive.google.com/uc?id=1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy
unzip eval.zip
-
Download
test2015
and put it under./playground/data/eval/vqav2
. -
Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
-
Submit the results to the evaluation server.
-
Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data
. -
Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
-
Download
test.json
and extracttest.zip
totest
. Put them under./playground/data/eval/vizwiz
. -
Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
-
Submit the results to the evaluation server.
-
Download
coco
from POPE and put under./playground/data/eval/pope
. -
Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
-
Download the data following the official instructions here.
-
Downloaded images to
MME_Benchmark_release_version
. -
put the official
eval_tool
andMME_Benchmark_release_version
under./playground/data/eval/MME
. -
Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
-
Download
mmbench_dev_20230712.tsv
and put under./playground/data/eval/mmbench
. -
Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
-
Submit the results to the evaluation server.