Skip to content

Conversation

@shuminghu
Copy link
Contributor

@shuminghu shuminghu commented Aug 6, 2025

Critical: Fixes missing video input for PerceptionLM (accidentally removed in PR)

Minor: Add support for vanilla image that only has C,H,W dims but not tiles dim.
This is non-default image shapes used in PLM but it's useful in demos and low-resoure devices.
e.g., in just added "PLM Simple Fine-tuning Example" under
https://huggingface.co/facebook/Perception-LM-1B#plm-usage

@shuminghu shuminghu marked this pull request as draft August 6, 2025 22:57
@shuminghu shuminghu marked this pull request as ready for review August 6, 2025 22:58
@github-actions github-actions bot requested review from molbap and yonigozlan August 6, 2025 22:59
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the fix, cc @zucchini-nlp who made the initial change!
For the non-standard image inputs, OK but would be better with a test that goes with it

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, thanks! I think we need to standardize output shapes from the image processor to be consistent though

Maybe we can always return 5D pixels or already flattened 4D pixels? Whichever way looks good, we have models doing both options

@zucchini-nlp zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Aug 7, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@shuminghu
Copy link
Contributor Author

shuminghu commented Aug 7, 2025

@zucchini-nlp The reason shape unification is done in models rather than image_processing is that i noticed in training model sees a different input shape than in eval/inference.

  • In training with per_device_train_batch_size=8, when image processor output 4D: processed_images.shape: torch.Size([1, 3, 448, 448]), model input is 5D: pixel_values.shape: torch.Size([8, 1, 3, 448, 448]). , when image processor output is 5D, model input is 6D and will error out.
  • In inference, both are 5D, processed_images.shape: torch.Size([1, 37, 3, 448, 448]), pixel_values.shape: torch.Size([1, 37, 3, 448, 448])

@shuminghu
Copy link
Contributor Author

@zucchini-nlp Let me split the PR and merge the more urgent fix first?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: perception_lm

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lets merge this one first and add it in the next patch release

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) August 7, 2025 15:42
@zucchini-nlp zucchini-nlp merged commit 27997ee into huggingface:main Aug 7, 2025
15 checks passed
@shuminghu
Copy link
Contributor Author

shuminghu commented Aug 7, 2025

@zucchini-nlp My bad. Just realized this from collate_fn in my training script ( I added one dimension)

    pixel_values = torch.stack(
        [inst["pixel_values"] for inst in instances], dim=0
    )

Let me open another PR for this simple fix for image_preprocessor and update corresponding training script in model card.

The reason shape unification is done in models rather than image_processing is that i noticed in training model sees a different input shape than in eval/inference.

In training with per_device_train_batch_size=8, when image processor output 4D: processed_images.shape: torch.Size([1, 3, 448, 448]), model input is 5D: pixel_values.shape: torch.Size([8, 1, 3, 448, 448]). , when image processor output is 5D, model input is 6D and will error out.
In inference, both are 5D, processed_images.shape: torch.Size([1, 37, 3, 448, 448]), pixel_values.shape: torch.Size([1, 37, 3, 448, 448])

ArthurZucker pushed a commit that referenced this pull request Aug 13, 2025
* Fix missing video inputs for PerceptionLM.

* Minor fix for vanilla input image (only C,H,W, no tiles dim).

* Revert "Minor fix for vanilla input image (only C,H,W, no tiles dim)."

This reverts commit 181d87b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants