Skip to content

video_inputs are not passed to perception_lm #40004

@4g

Description

@4g

System Info

  • transformers version: 4.55.0
  • Platform: Linux-6.14.0-27-generic-x86_64-with-glibc2.39
  • Python version: 3.11.9
  • Huggingface_hub version: 0.34.3
  • Safetensors version: 0.4.3
  • Accelerate version: 1.6.0
  • Accelerate config: not found
  • DeepSpeed version: 0.17.1
  • PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA GeForce RTX 4090

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download

MODEL_PATH = "facebook/Perception-LM-1B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")

video_file = hf_hub_download(
    repo_id="shumingh/perception_lm_test_videos",
    filename="GUWR5TyiY-M_000012_000022.mp4",
    repo_type="dataset",
)
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "url": video_file,
            },
            {"type": "text", "text": "Can you describe the video in detail?"},
        ],
    }
]
inputs = processor.apply_chat_template(
    [conversation],
    num_frames=32,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_load_backend="decord",
)
inputs = inputs.to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=256)
input_length = inputs["input_ids"].shape[1]
generate_ids_without_inputs = generate_ids[:, input_length:]

for output in processor.batch_decode(
    generate_ids_without_inputs, skip_special_tokens=True
):
    print(output)

print(inputs.pixel_values_videos)

Expected behavior

Error happens in the official example on using PLM with video inputs . The output has no correlation to input, because pixel_values_videos key is missing from inputs. I believe this error was introduced in this commit
when videos_inputs was removed from BatchFeature.

When passed video in conversaion, inputs only have these keys :
{'input_ids', 'attention_mask'}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions