Prism is a framework built on VLMEvalKit for decoupling and accessing the capabilities of large vision-language models (LVLMs). It comprises two distinct stages: 1) Perception Stage that first instructs VLMs to extract and express visual information of the image; 2) Reasoning Stage that utilizes a external LLM (GPT-3.5, GPT-4, etc) to conduct reasoning and answer the question based on the textual information. Prism can both enable the breakdown analysis of VLM capabilities and serve as a solution for vision-language tasks by integrating any given VLM and LLM.
PrismCaptioners are VLMs we trained with LLaVA architecture and ALLaVA dataset, which can be used for perception in Prism framework. We have released PrismCaptioner-7B, PrismCaptioner-2B.
Demo
from demo import Perception, Reasoning
text = 'What is this framework about?'
img_path = 'Prism.jpg'
perception_module = Perception(prompt_version='generic', model='GPT4V')
reasoning_module = Reasoning(model='chatgpt-0125')
des = perception_module.generate(text, img_path)
res = reasoning_module.generate(des, text)
-
Perception Module:
supported_VLM
(in VLMEvalkit),PrismCaptioners
-
Reasoning Module:
gpt models
,vllm models
,deepseek models
. Check config for api calling.
Before running Prism, you need prepare relevant requisites including VLMEvalKit and query-specific instructions. After that, you can check Usage for decoupling VLMs.
Check VLMEvalKit Quickstart for preparation. Completion of Step 0 and Step 1 is sufficient for Prism. Make sure you are using the environment with VLMEvalKit for Prism.
git clone https://github.com/SparksJoe/Prism
cd Prism
If you want to use query-specific instructions for perception, use the following command to generate query-specific parts for the required benchmark by reasoning module.
# Generate query-specific parts
python gen_prompt.py --data MMStar --model chatgpt-0125
Make sure that you are using the same reasoning module when generating query-specific parts and conduct reasoning
After Preparation, you can run Prism with reasoning modules of gpt models
and deepseek models
. For huggingface models like llama-70b-chat
, you can deploy them with vllm
following the second part.
Use run.py
with python
or torchrun
for Prism. The default setting --config
for Prism is shown in default config. Here are annotations for arguments.
Arguments
--data (str, default to 'MMStar')
: Set the benchmark you want to perform Prism on.--model (str, default to 'GPT4V')
: Set the VLM name that is used for the perception module.--infer_model (str, default to 'chatgpt-0125')
: Set the LLM name that is used for the reasoning module.--prompt_version (str, default to 'generic')
: Set the instruction for the perception stage. Check prompts for details.--mode (str, default to 'all', choices are ['perception', 'reasoning'])
: Whenmode
set to "all", Prism will perform perception, reasoning and evaluation; when set to "perception", will only perform perception; when set to "reasoning", will perform perception and reasoning,--nproc (int, default to 4)
: The number of threads for API calling.--postproc (default to False)
: Whether to use random choice for postprocessing.
There are two ways to run your Prism.
Use a Custom Config. Write the settings you want into self_config.json
.
# use python
python run.py --config config/self_config.json
# use torchrun for multi-gpu inference
torchrun --nproc_per_node={gpu_nums} run.py --config config/self_config.json
Use Parameters. Pass the parameters you modified in the command line, and they will replace the orginial ones in default config.
# use python
python run.py --model llava_next_yi_34b --infer_model gpt-4-0125
# use torchrun for multi-gpu inference
torchrun --nproc_per_node={gpu_nums} run.py --model llava_next_yi_34b --infer_model gpt-4-0125
The command above replaces model
and infer_model
in the default setting.
Use Query-Specific Instruction. You should keep the reasoning module consistent with the prompt version.
python run.py --model llava_next_yi_34b --prompt_version query-specific_chatgpt-0125 --infer_model chatgpt-0125
# use torchrun for multi-gpu inference
torchrun --nproc_per_node={gpu_nums} run.py --model llava_next_yi_34b --prompt_version query-specific_chatgpt-0125 --infer_model chatgpt-0125
Use PrismCaptioner. Prism now supports PrismCaptioner-[2B/7B]
. Just use --model prismcaptioner-2b
.
You can deploy open-source huggingface models for reasoning with vllm
.
First install:
pip install vllm
And then deploy the model with command lines. For Meta-Llama-3-70B-Instruct
, use
python -m vllm.entrypoints.openai.api_server \
-tp {gpu_nums} \
--model ${MODEL_PATH} \
--served-model-name llama3-70b-chat \
--port 8080
The default port used in Prism is 8080
, and pay attention to keep --served-model-name
consistent with model name in config. Moreover, remember to set stop tokens for vllm models. Then, you can call the model name for reasoning with the command line.
python run.py --model GPT4V --infer_model llama3-70b-chat
Merge VLMs. If you have generated perception results from two different VLMs on the same benchmark with identical prompt, for instance, GPT4V
and GeminiProVision
, you can pass GPT4V~GeminiProVision
to --model
in order to conduct reasoning on the merged information from them.
Max Output Length. For better reasoning performance, you can pass the infer model with a suffix -2048
like llama3-70b-chat-2048
to set larger max output length for reasoning module.
The results should be listed as the following structure.
└── results
├── prompt_version (e.g., generic)
│ ├── dataset_name (e.g., MMStar)
│ │ ├── frontend (e.g., GPT4V)
│ │ │ ├── describe result file
│ │ │ ├── backend (e.g., chatgpt-0125)
│ │ │ │ ├── post_infer result files
│ │ │ │ ├── evaluation result files
│ │ │ ...
│ │ └──frontend_backend (e.g., gpt4-0125)
│ │ ...
│ └──dataset_name (e.g., MMStar)
│ ...
└── prompt_version (e.g., query-specific_chatgpt-0125)
...
If you find our work helpful for your research, please consider giving a star and citation
@article{qiao2024prism,
title={Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs},
author={Qiao, Yuxuan and Duan, Haodong and Fang, Xinyu and Yang, Junming and Chen, Lin and Zhang, Songyang and Wang, Jiaqi and Lin, Dahua and Chen, Kai},
journal={arXiv preprint arXiv:2406.14544},
year={2024}
}
- VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)
- XTuner: An efficient, flexible and full-featured toolkit for fine-tuning LLM
- ALLaVA: Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model
- Utmost gratitude to Kenny