Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Quantization support for LLaVA OneVision #9324

Closed
1 task done
salvaba94 opened this issue Oct 13, 2024 · 4 comments
Closed
1 task done

[Feature]: Quantization support for LLaVA OneVision #9324

salvaba94 opened this issue Oct 13, 2024 · 4 comments

Comments

@salvaba94
Copy link

🚀 The feature, motivation and pitch

I'm working on applications that must run locally in resource-limited HW. Threrefore, quantization becomes essential. Such applications need from multimodal video-text processing. The candidate model in question is LLaVA OneVision. However, it does not support BitsAndBytes quantization yet.

Model
LLaVA-OneVision
https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf

Challenges
AFAIK Siglip, the multimodal projector and Qwen2 need from quantization support. Perhaps it is also useful to enable quantization per module to quantize only the language part.

Alternatives

No response

Additional context

Trying to load a pre-quantized LLaVA OneVision model into vLLM throws:

AttributeError: Model LlavaOnevisionForConditionalGeneration does not support BitsAndBytes quantization yet.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@DarkLight1337 DarkLight1337 changed the title [Feature]: LLaVA OneVision [Feature]: Quantization support for LLaVA OneVision Oct 13, 2024
@DarkLight1337
Copy link
Member

Which layers does it quantize specifically? Can you suggest what we need to set in bitsandbytes_stacked_params_mapping to get this to work? cc @mgoin

@salvaba94
Copy link
Author

From what I've been looking into, from the vision tower, the only thing it does not quantize are the embeddings. It quantizes all the SigLip self-attention and the MLP blocks. The multimodal projector linear layers are totally quantized. And for the Qwen2 language model, everything except for the head is quantized. This again includes self-attention and MLP. See below the quantized model:

LlavaOnevisionForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(729, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-25): 26 x SiglipEncoderLayer(
            (self_attn): SiglipFlashAttention2(
              (k_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear4bit(in_features=1152, out_features=4304, bias=True)
              (fc2): Linear4bit(in_features=4304, out_features=1152, bias=True)
            )
            (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
    )
  )
  (multi_modal_projector): LlavaOnevisionMultiModalProjector(
    (linear_1): Linear4bit(in_features=1152, out_features=896, bias=True)
    (act): GELUActivation()
    (linear_2): Linear4bit(in_features=896, out_features=896, bias=True)
  )
  (language_model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(152000, 896)
      (layers): ModuleList(
        (0-23): 24 x Qwen2DecoderLayer(
          (self_attn): Qwen2FlashAttention2(
            (q_proj): Linear4bit(in_features=896, out_features=896, bias=True)
            (k_proj): Linear4bit(in_features=896, out_features=128, bias=True)
            (v_proj): Linear4bit(in_features=896, out_features=128, bias=True)
            (o_proj): Linear4bit(in_features=896, out_features=896, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear4bit(in_features=896, out_features=4864, bias=False)
            (up_proj): Linear4bit(in_features=896, out_features=4864, bias=False)
            (down_proj): Linear4bit(in_features=4864, out_features=896, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((896,), eps=1e-06)
      (rotary_emb): Qwen2RotaryEmbedding()
    )
    (lm_head): Linear(in_features=896, out_features=152000, bias=False)
  )
)

Tweaking some arguments of the BitsAndBytesConfig structure, quantization of the vision encoder can also be skipped (the vision_tower in the above model description).

I'm not very familiar with what the bitsandbytes_stacked_params_mapping means to be. With some initial guidance I think I can try to get it working. A first approach would be to get the Qwen2 and SigLips models working separately.

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Jan 14, 2025
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants