Skip to content

Latest commit

 

History

History
159 lines (112 loc) · 11.2 KB

README.md

File metadata and controls

159 lines (112 loc) · 11.2 KB

iLLaVA

Static Badge

iLLaVA is an efficient method for large vision language models by merging visual tokens. It could achieve about throughput and 1.7× - 2× memory reduction with comparable performance through merging redundant visual tokens in some certain layers.

Fig.1: The framework of iLLaVA

Fig.2: The efficiency of iLLaVA

Fig.3: The generalizability of iLLaVA

Fig.4: The visualization of iLLaVA

Scheduled Updates🔥

    • Setup
    • Inference and Evaluation
    • Visualizations
    • Supporting both image and video benchmarks

🧨Setup

conda create -n illava python=3.10
conda activate illava
bash setup.sh

Notice that you should install numpy=1.x instead of numpy=2.x

🎈Inference

This repo provides the inference code for iLLaVA based on LLaVA-OneVision.

  1. You should manually download the pretrained weight for LLaVA-OneVision (e.g., LLaVA-OneVision 7B) from here, or conducting the following command to download it:
pip install -U huggingface_hub
huggingface-cli download --resume-download lmms-lab/llava-onevision-qwen2-7b-ov --local-dir /path_to_your_dir --local-dir-use-symlinks False --resume-download

For users who are unable to visit huggingface (e.g., China), you can conduct the following command:

pip install -U huggingface_hub
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download lmms-lab/llava-onevision-qwen2-7b-ov --local-dir /path_to_your_dir --local-dir-use-symlinks False --resume-download

We use different settings for image benchmarks and video benchmarks of iLLaVA. Specifically, we reduce 364 tokens in the image encoder, which are merged in layers [5,6,7,8] for image benchmarks, and in layers [3,4,5,6] for video benchmarks. We reduce about 30% tokens in the language model, which are merged in layer 8 for image benchmarks, and in layer 2 for video benchmarks. We observe there exist more feature redundancy in videos which allow more aggressive token merging.

Single-image benchmarks and Multi-image benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,enable_illava_vit=True,illava_vit_k=5-6-7-8,illava_vit_r=92,enable_illava_llm=True,illava_llm_k=8,illava_llm_r=0.70 --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./logs

Video benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,max_frames_num=32,enable_illava_vit=True,illava_vit_k=3-4-5-6,illava_vit_r=92,enable_illava_llm=True,illava_llm_k=2,illava_llm_r=0.70 --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./log

Replace the /path_to_your_checkpoint with the path to your downloaded LLaVA-OneVision pretrained weights. Set the your_benchmark as your target benchmark, which can be selected from supported tasks of lmms-eval.

If you are difficult to visit https://huggingface.co/ (e.g., in China), place HF_ENDPOINT=https://hf-mirror.com in the beginning of your command.

The log files are saved in ./logs.

✨Visualization: the token merging process

Single-image benchmarks and Multi-image benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,enable_illava_vit=True,illava_vit_k=5-6-7-8,illava_vit_r=92,illava_track_vit_source=True,enable_illava_llm=True,illava_llm_k=8,illava_llm_r=0.70,illava_track_llm_source=True --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./logs

Video benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,max_frames_num=32,enable_illava_vit=True,illava_vit_k=2-3-4-5-6-7-8,illava_vit_r=80,illava_track_vit_source=True,enable_illava_llm=True,illava_llm_k=2,illava_llm_r=0.50,illava_track_llm_source=True,mm_spatial_pool_stride=1 --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./log

We here set a more aggressive merging procedure for video benchmarks to show better visualization results. You may modify the hyper-parameters by yourself.

Token merging visualizations for different layers would be stored with a prefix of attention_map_vit_layer_{remained_token_num}.jpg and attention_map_llm_layer_{remained_token_num}.jpg for vit stages and llm stages in the current dir.

Note that the visualization for images may not be fully spatially aligned, due to the existence of image_newline parameter of LLaVA-Onevision.

🍕Inference with one input

We provide a run_inference_once.py to help users use iLLaVA by specifying one input. The acceptable inputs include a single image, multiple images or a video.

The parameters you need to specify in the command include:

  • model_path, which indicates the path to the pretrained model.
  • input_path, which could be the path to an image file, the directory of multiple images or the path to a video file.
  • question, which is the question proposed by the user. Different words should be separated by - for parsing the command. For example, the default input is describe_the_input.

Other parameters may refer to the run_inference_once.py.

Example: inputting a single image

python run_inference_once.py --enable_illava_vit True --illava_vit_k 5-6-7-8 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_image/xxx.jpg

Example: inputting multiple images

python run_inference_once.py --enable_illava_vit True --illava_vit_k 5-6-7-8 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_images

Example: inputting a video

python run_inference_once.py --enable_illava_vit True --illava_vit_k 3-4-5-6 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 2 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_video/xxx.mp4

You could set --max_frames_num 32 to set different input frames.

Visualization

You could add --illava_track_vit_source True --illava_track_llm_source True in the command to enable visualization.

For an image/images, we recommend using the following command:

python run_inference_once.py --enable_illava_vit True --illava_vit_k 2-3-4-5-6-7-8 --illava_vit_r 80 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.50 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_image.jpg

For videos, we recommend using mm_spatial_pool_stride=1 and larger merging steps to enable better visualization.

python run_inference_once.py --enable_illava_vit True --illava_vit_k 2-3-4-5-6-7-8 --illava_vit_r 80 --enable_illava_llm True --illava_llm_k 2 --illava_llm_r 0.50 --model_path /path_to_your_checkpoint --question describe_the_input --mm_spatial_pool_stride 1 --input_path /path_to_your_video/xxx.mp4

🎄Demo

We provide a offline demo to help users deploy iLLaVA on their local machines. It supports inputting a single image, multiple images or a video, and the get the outputs from iLLaVA.

The command is shown as follows:

python demo.py --enable_illava_vit True --illava_vit_k 5-6-7-8 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint

Below is the visualization for our demo.

Upload an image, multiple images or a video and enter a prompt to get the outputs from iLLaVA

Fig.5: The visualization of our demo

🎫Model hyper-parameters

Besides the original paramters of LLaVA-Onevision, we introduce several new paramters:

  • enable_illava_vit[bool], whether enables using iLLaVA in the ViT stage. Default: False.
  • illava_vit_k[str], the layers to merge tokens in the ViT stage. For example, 2-3-4-5 indicates layers [2,3,4,5]. Default: None.
  • illava_vit_r[int], the number of tokens to be merged in each layer of the ViT stage. The overall number of merged tokens are the multiplication of illava_vit_k and illava_vit_r. Default: 0.
  • enable_illava_llm[bool], whether enables using iLLaVA in the LLM stage. Default: False.
  • illava_llm_k[str], the layers to merge tokens in the LLM stage. For example, 2 indicates layers [2]. Default: None.
  • illava_llm_r[int], the number of tokens to be merged in each layer of the LLM stage. The overall number of merged tokens are the multiplication of illava_llm_k and illava_llm_r. Default: 0.
  • illava_llm_image_token_start_index[int], the starting index of image tokens in the language model, which is predefined according to the system prompts. Default: 14.
  • illava_track_vit_source[bool], whether performing visualization for the token merging process in the ViT stage. Default: False.
  • illava_track_llm_source[bool], whether performing visualization for the token merging process in the ViT stage. Default: False.

You can set the corresponding parameters in the model_args of the command like we provide in the inference section.

🛒Model inplementation

We mainly modify the following files to conduct different functions:

🎁Acknowledgements

Thanks to FastV, FreeVideoLLM for their open-source code.