Skip to content

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Notifications You must be signed in to change notification settings

hulianyuyy/iLLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iLLaVA

Static Badge

iLLaVA is an efficient method for large vision language models by merging visual tokens. It could achieve about throughput and 1.7× - 2× memory reduction with comparable performance through merging redundant visual tokens in some certain layers.

Fig.1: The framework of iLLaVA

Fig.2: The efficiency of iLLaVA

Fig.3: The generalizability of iLLaVA

Fig.4: The visualization of iLLaVA

Scheduled Updates🔥

    • Setup
    • Inference and Evaluation
    • Visualizations
    • Supporting both image and video benchmarks

🧨Setup

conda create -n illava python=3.10
conda activate illava
bash setup.sh

Notice that you should install numpy=1.x instead of numpy=2.x

🎈Inference

This repo provides the inference code for iLLaVA based on LLaVA-OneVision.

  1. You should manually download the pretrained weight for LLaVA-OneVision (e.g., LLaVA-OneVision 7B) from here, or conducting the following command to download it:
pip install -U huggingface_hub
huggingface-cli download --resume-download lmms-lab/llava-onevision-qwen2-7b-ov --local-dir /path_to_your_dir --local-dir-use-symlinks False --resume-download

For users who are unable to visit huggingface (e.g., China), you can conduct the following command:

pip install -U huggingface_hub
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download lmms-lab/llava-onevision-qwen2-7b-ov --local-dir /path_to_your_dir --local-dir-use-symlinks False --resume-download

We use different settings for image benchmarks and video benchmarks of iLLaVA. Specifically, we reduce 364 tokens in the image encoder, which are merged in layers [5,6,7,8] for image benchmarks, and in layers [3,4,5,6] for video benchmarks. We reduce about 30% tokens in the language model, which are merged in layer 8 for image benchmarks, and in layer 2 for video benchmarks. We observe there exist more feature redundancy in videos which allow more aggressive token merging.

Single-image benchmarks and Multi-image benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,enable_illava_vit=True,illava_vit_k=5-6-7-8,illava_vit_r=92,enable_illava_llm=True,illava_llm_k=8,illava_llm_r=0.70 --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./logs

Video benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,max_frames_num=32,enable_illava_vit=True,illava_vit_k=3-4-5-6,illava_vit_r=92,enable_illava_llm=True,illava_llm_k=2,illava_llm_r=0.70 --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./log

Replace the /path_to_your_checkpoint with the path to your downloaded LLaVA-OneVision pretrained weights. Set the your_benchmark as your target benchmark, which can be selected from supported tasks of lmms-eval.

If you are difficult to visit https://huggingface.co/ (e.g., in China), place HF_ENDPOINT=https://hf-mirror.com in the beginning of your command.

The log files are saved in ./logs.

✨Visualization: the token merging process

Single-image benchmarks and Multi-image benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,enable_illava_vit=True,illava_vit_k=5-6-7-8,illava_vit_r=92,illava_track_vit_source=True,enable_illava_llm=True,illava_llm_k=8,illava_llm_r=0.70,illava_track_llm_source=True --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./logs

Video benchmarks

lmms-eval --model llava_onevision_training_free --model_args pretrained=/path_to_your_checkpoint,conv_template=qwen_1_5,model_name=llava_qwen_training_free,device_map=auto,max_frames_num=32,enable_illava_vit=True,illava_vit_k=2-3-4-5-6-7-8,illava_vit_r=80,illava_track_vit_source=True,enable_illava_llm=True,illava_llm_k=2,illava_llm_r=0.50,illava_track_llm_source=True,mm_spatial_pool_stride=1 --task your_benchmark --batch_size 1 --log_samples --log_samples_suffix llava_onevision_7b --output_path ./log

We here set a more aggressive merging procedure for video benchmarks to show better visualization results. You may modify the hyper-parameters by yourself.

Token merging visualizations for different layers would be stored with a prefix of attention_map_vit_layer_{remained_token_num}.jpg and attention_map_llm_layer_{remained_token_num}.jpg for vit stages and llm stages in the current dir.

Note that the visualization for images may not be fully spatially aligned, due to the existence of image_newline parameter of LLaVA-Onevision.

🍕Inference with one input

We provide a run_inference_once.py to help users use iLLaVA by specifying one input. The acceptable inputs include a single image, multiple images or a video.

The parameters you need to specify in the command include:

  • model_path, which indicates the path to the pretrained model.
  • input_path, which could be the path to an image file, the directory of multiple images or the path to a video file.
  • question, which is the question proposed by the user. Different words should be separated by - for parsing the command. For example, the default input is describe_the_input.

Other parameters may refer to the run_inference_once.py.

Example: inputting a single image

python run_inference_once.py --enable_illava_vit True --illava_vit_k 5-6-7-8 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_image/xxx.jpg

Example: inputting multiple images

python run_inference_once.py --enable_illava_vit True --illava_vit_k 5-6-7-8 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_images

Example: inputting a video

python run_inference_once.py --enable_illava_vit True --illava_vit_k 3-4-5-6 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 2 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_video/xxx.mp4

You could set --max_frames_num 32 to set different input frames.

Visualization

You could add --illava_track_vit_source True --illava_track_llm_source True in the command to enable visualization.

For an image/images, we recommend using the following command:

python run_inference_once.py --enable_illava_vit True --illava_vit_k 2-3-4-5-6-7-8 --illava_vit_r 80 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.50 --model_path /path_to_your_checkpoint --question describe_the_input --input_path /path_to_your_image.jpg

For videos, we recommend using mm_spatial_pool_stride=1 and larger merging steps to enable better visualization.

python run_inference_once.py --enable_illava_vit True --illava_vit_k 2-3-4-5-6-7-8 --illava_vit_r 80 --enable_illava_llm True --illava_llm_k 2 --illava_llm_r 0.50 --model_path /path_to_your_checkpoint --question describe_the_input --mm_spatial_pool_stride 1 --input_path /path_to_your_video/xxx.mp4

🎄Demo

We provide a offline demo to help users deploy iLLaVA on their local machines. It supports inputting a single image, multiple images or a video, and the get the outputs from iLLaVA.

The command is shown as follows:

python demo.py --enable_illava_vit True --illava_vit_k 5-6-7-8 --illava_vit_r 92 --enable_illava_llm True --illava_llm_k 8 --illava_llm_r 0.70 --model_path /path_to_your_checkpoint

Below is the visualization for our demo.

Upload an image, multiple images or a video and enter a prompt to get the outputs from iLLaVA

Fig.5: The visualization of our demo

🎫Model hyper-parameters

Besides the original paramters of LLaVA-Onevision, we introduce several new paramters:

  • enable_illava_vit[bool], whether enables using iLLaVA in the ViT stage. Default: False.
  • illava_vit_k[str], the layers to merge tokens in the ViT stage. For example, 2-3-4-5 indicates layers [2,3,4,5]. Default: None.
  • illava_vit_r[int], the number of tokens to be merged in each layer of the ViT stage. The overall number of merged tokens are the multiplication of illava_vit_k and illava_vit_r. Default: 0.
  • enable_illava_llm[bool], whether enables using iLLaVA in the LLM stage. Default: False.
  • illava_llm_k[str], the layers to merge tokens in the LLM stage. For example, 2 indicates layers [2]. Default: None.
  • illava_llm_r[int], the number of tokens to be merged in each layer of the LLM stage. The overall number of merged tokens are the multiplication of illava_llm_k and illava_llm_r. Default: 0.
  • illava_llm_image_token_start_index[int], the starting index of image tokens in the language model, which is predefined according to the system prompts. Default: 14.
  • illava_track_vit_source[bool], whether performing visualization for the token merging process in the ViT stage. Default: False.
  • illava_track_llm_source[bool], whether performing visualization for the token merging process in the ViT stage. Default: False.

You can set the corresponding parameters in the model_args of the command like we provide in the inference section.

🛒Model inplementation

We mainly modify the following files to conduct different functions:

🎁Acknowledgements

Thanks to FastV, FreeVideoLLM for their open-source code.

About

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published