[Feature]: Quantization support for LLaVA OneVision #9324

salvaba94 · 2024-10-13T08:46:05Z

🚀 The feature, motivation and pitch

I'm working on applications that must run locally in resource-limited HW. Threrefore, quantization becomes essential. Such applications need from multimodal video-text processing. The candidate model in question is LLaVA OneVision. However, it does not support BitsAndBytes quantization yet.

Model
LLaVA-OneVision
https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf

Challenges
AFAIK Siglip, the multimodal projector and Qwen2 need from quantization support. Perhaps it is also useful to enable quantization per module to quantize only the language part.

Alternatives

No response

Additional context

Trying to load a pre-quantized LLaVA OneVision model into vLLM throws:

AttributeError: Model LlavaOnevisionForConditionalGeneration does not support BitsAndBytes quantization yet.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-10-13T13:43:36Z

Which layers does it quantize specifically? Can you suggest what we need to set in bitsandbytes_stacked_params_mapping to get this to work? cc @mgoin

salvaba94 · 2024-10-15T18:25:07Z

From what I've been looking into, from the vision tower, the only thing it does not quantize are the embeddings. It quantizes all the SigLip self-attention and the MLP blocks. The multimodal projector linear layers are totally quantized. And for the Qwen2 language model, everything except for the head is quantized. This again includes self-attention and MLP. See below the quantized model:

LlavaOnevisionForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(729, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-25): 26 x SiglipEncoderLayer(
            (self_attn): SiglipFlashAttention2(
              (k_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear4bit(in_features=1152, out_features=4304, bias=True)
              (fc2): Linear4bit(in_features=4304, out_features=1152, bias=True)
            )
            (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
    )
  )
  (multi_modal_projector): LlavaOnevisionMultiModalProjector(
    (linear_1): Linear4bit(in_features=1152, out_features=896, bias=True)
    (act): GELUActivation()
    (linear_2): Linear4bit(in_features=896, out_features=896, bias=True)
  )
  (language_model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(152000, 896)
      (layers): ModuleList(
        (0-23): 24 x Qwen2DecoderLayer(
          (self_attn): Qwen2FlashAttention2(
            (q_proj): Linear4bit(in_features=896, out_features=896, bias=True)
            (k_proj): Linear4bit(in_features=896, out_features=128, bias=True)
            (v_proj): Linear4bit(in_features=896, out_features=128, bias=True)
            (o_proj): Linear4bit(in_features=896, out_features=896, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear4bit(in_features=896, out_features=4864, bias=False)
            (up_proj): Linear4bit(in_features=896, out_features=4864, bias=False)
            (down_proj): Linear4bit(in_features=4864, out_features=896, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((896,), eps=1e-06)
      (rotary_emb): Qwen2RotaryEmbedding()
    )
    (lm_head): Linear(in_features=896, out_features=152000, bias=False)
  )
)

Tweaking some arguments of the BitsAndBytesConfig structure, quantization of the vision encoder can also be skipped (the vision_tower in the above model description).

I'm not very familiar with what the bitsandbytes_stacked_params_mapping means to be. With some initial guidance I think I can try to get it working. A first approach would be to get the Qwen2 and SigLips models working separately.

github-actions · 2025-01-14T01:56:35Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-02-13T01:59:30Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

salvaba94 added the feature request label Oct 13, 2024

DarkLight1337 changed the title ~~[Feature]: LLaVA OneVision~~ [Feature]: Quantization support for LLaVA OneVision Oct 13, 2024

DarkLight1337 mentioned this issue Oct 18, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

91 tasks

DarkLight1337 mentioned this issue Oct 29, 2024

[Bugfix] Fix prefix strings for quantized VLMs #9772

Merged

github-actions bot added the stale label Jan 14, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Quantization support for LLaVA OneVision #9324

[Feature]: Quantization support for LLaVA OneVision #9324

salvaba94 commented Oct 13, 2024

DarkLight1337 commented Oct 13, 2024

salvaba94 commented Oct 15, 2024

github-actions bot commented Jan 14, 2025

github-actions bot commented Feb 13, 2025

[Feature]: Quantization support for LLaVA OneVision #9324

[Feature]: Quantization support for LLaVA OneVision #9324

Comments

salvaba94 commented Oct 13, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

DarkLight1337 commented Oct 13, 2024

salvaba94 commented Oct 15, 2024

github-actions bot commented Jan 14, 2025

github-actions bot commented Feb 13, 2025