Legacy processing is triggered even when only pure text is input in llava without img_token #35526

jp1924 · 2025-01-06T07:46:20Z

What does this PR do?

When training llava, it sometimes mixes vision instruction data with text instruction data. However, an error occurs when only text instruction data is input.

AttributeError
'NoneType' object has no attribute 'shape'
  File "/root/transformers/src/transformers/models/vipllava/modeling_vipllava.py", line 299, in _merge_input_ids_with_image_features
    num_images, num_image_patches, embed_dim = image_features.shape
  File "/root/transformers/src/transformers/models/vipllava/modeling_vipllava.py", line 483, in forward
    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
  File "/root/llava.py", line 13, in <module>
    model(**outputs)
AttributeError: 'NoneType' object has no attribute 'shape'

The reason for this issue is that legacy_processing becomes true.
If we add a check for image_feature here, it will handle the case smoothly even when only text is input.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

jp1924 · 2025-01-06T07:49:50Z

reproduction code

from transformers import AutoModelForImageTextToText, AutoProcessor


name = "llava-hf/vip-llava-7b-hf"
model, processor = AutoModelForImageTextToText.from_pretrained(name), AutoProcessor.from_pretrained(name)

prompts = [
    "USER: What are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
    "USER: What is this? ASSISTANT:",
]

outputs = processor(prompts, return_tensors="pt", padding=True, truncation=True)
model(**outputs)

Env

Copy-and-paste the text below in your GitHub issue.

huggingface_hub version: 0.26.2
Platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Running in Google Colab Enterprise ?: No
Token path ?: /root/.cache/huggingface/token
Has saved token ?: True
Who am I ?: jp1924
Configured git credential helpers:
FastAI: N/A
Tensorflow: 2.15.1
Torch: 2.5.1+cu121
Jinja2: 3.1.4
Graphviz: N/A
keras: 2.15.0
Pydot: N/A
Pillow: 11.0.0
hf_transfer: N/A
gradio: N/A
tensorboard: 2.6.2.2
numpy: 1.26.4
pydantic: 2.9.2
aiohttp: 3.11.6
ENDPOINT: https://huggingface.co
HF_HUB_CACHE: /root/.cache/huggingface/hub
HF_ASSETS_CACHE: /root/.cache/huggingface/assets
HF_TOKEN_PATH: /root/.cache/huggingface/token
HF_STORED_TOKENS_PATH: /root/.cache/huggingface/stored_tokens
HF_HUB_OFFLINE: False
HF_HUB_DISABLE_TELEMETRY: False
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_ENABLE_HF_TRANSFER: False
HF_HUB_ETAG_TIMEOUT: 10
HF_HUB_DOWNLOAD_TIMEOUT: 10

zucchini-nlp · 2025-01-06T13:15:58Z

Will be fixed soon by #34502, this was caused by an incorrect indentation when processing image features, which should go one more indent further

jp1924 · 2025-01-07T00:10:27Z

Ah, you were already working on it. I'll modify the code and work on it until it's resolved. Thanks!

jp1924 added 2 commits January 6, 2025 07:44

Update: legacy processing checks to ensure image features are not None

16a1e27

revert

b4de683

Add: to moduler

a894623

jp1924 closed this Jan 7, 2025

jp1924 mentioned this pull request Jan 20, 2025

Add huggingface llava linkedin/Liger-Kernel#524

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legacy processing is triggered even when only pure text is input in llava without img_token #35526

Legacy processing is triggered even when only pure text is input in llava without img_token #35526

jp1924 commented Jan 6, 2025

jp1924 commented Jan 6, 2025 •

edited

Loading

zucchini-nlp commented Jan 6, 2025

jp1924 commented Jan 7, 2025

Legacy processing is triggered even when only pure text is input in llava without img_token #35526

Legacy processing is triggered even when only pure text is input in llava without img_token #35526

Conversation

jp1924 commented Jan 6, 2025

What does this PR do?

Before submitting

Who can review?

jp1924 commented Jan 6, 2025 • edited Loading

reproduction code

Env

zucchini-nlp commented Jan 6, 2025

jp1924 commented Jan 7, 2025

jp1924 commented Jan 6, 2025 •

edited

Loading