-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Closed
Labels
Description
System Info
transformersversion: 4.55.0- Platform: Linux-6.14.0-27-generic-x86_64-with-glibc2.39
- Python version: 3.11.9
- Huggingface_hub version: 0.34.3
- Safetensors version: 0.4.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- DeepSpeed version: 0.17.1
- PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: NVIDIA GeForce RTX 4090
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download
MODEL_PATH = "facebook/Perception-LM-1B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")
video_file = hf_hub_download(
repo_id="shumingh/perception_lm_test_videos",
filename="GUWR5TyiY-M_000012_000022.mp4",
repo_type="dataset",
)
conversation = [
{
"role": "user",
"content": [
{
"type": "video",
"url": video_file,
},
{"type": "text", "text": "Can you describe the video in detail?"},
],
}
]
inputs = processor.apply_chat_template(
[conversation],
num_frames=32,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_load_backend="decord",
)
inputs = inputs.to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=256)
input_length = inputs["input_ids"].shape[1]
generate_ids_without_inputs = generate_ids[:, input_length:]
for output in processor.batch_decode(
generate_ids_without_inputs, skip_special_tokens=True
):
print(output)
print(inputs.pixel_values_videos)
Expected behavior
Error happens in the official example on using PLM with video inputs . The output has no correlation to input, because pixel_values_videos key is missing from inputs. I believe this error was introduced in this commit
when videos_inputs was removed from BatchFeature.
When passed video in conversaion, inputs only have these keys :
{'input_ids', 'attention_mask'}