Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fuyu] Replace it to BatchFeature #27109

Closed
wants to merge 1 commit into from

Conversation

younesbelkada
Copy link
Contributor

What does this PR do?

Right now users needs to manually loop over FuyuProcessor's output and apply to to each element. One should use BatchFeature from image_processing_utils and call to directly to the processed elements

Before this PR users needed to do:

model_inputs = processor(text=text_prompt, images=raw_image)
for k, v in model_inputs.items():
    if v.dtype != torch.long:
        v = v.to(torch.float16)
    model_inputs[k] = v.to("cuda")

To run inference on 4bit models

Now they just have to

model_inputs = processor(text=text_prompt, images=raw_image, return_tensors="pt").to("cuda", torch.float16)

cc @ArthurZucker

Script to run the model in 4bit:

import torch
from transformers import FuyuProcessor, FuyuForCausalLM
from PIL import Image
import requests

# load model and processor
model_id = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_id)
model = FuyuForCausalLM.from_pretrained(model_id, device_map="cuda:0", load_in_4bit=True)

# prepare inputs for the model
text_prompt = "Generate a coco-style caption.\n"
img_url = 'https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(text=text_prompt, images=raw_image, return_tensors="pt").to("cuda", torch.float16)

# autoregressively generate text
generation_output = model.generate(**inputs, max_new_tokens=7)
generation_text = processor.batch_decode(generation_output[:, -7:], skip_special_tokens=True)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Related to #27007 but not adressed in it, and cc @amyeroberts let's merge this for now just to be sure we have it

@younesbelkada
Copy link
Contributor Author

THanks !
I just copied over the logic that was in place in BLIP - https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/processing_blip.py#L129 (the typehint is wrong there BTW, it returns a BatchFeature) per my understanding for processors that have both text and image input uses BatchFeature let me know if another approach is preferred @amyeroberts

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 27, 2023

The documentation is not available anymore as the PR was closed or merged.

@amyeroberts
Copy link
Collaborator

@younesbelkada Thanks for addressing this!

If it's OK with you - can we hold off on this for a day or two? I'm currently working on refactoring the image processing and processing code for Fuyu and this will be addressed there too :)

If you look at #27007 - you'll see that there's a custom BatchEncoding class added (it should actually be a BatchFeature class because there's float tensors). This is to address the atypical data structure that the processor class is returning - lists of lists instead of tensors. This is because each sample in a minibatch can have a variable number of images. There's an internal discussion on slack asking how we should represent the input/output data to reflect this. At the moment, we can wrap with BatchFeature as done in this PR but I'm not certain it extends to batch sizes of more than 1.

@amyeroberts
Copy link
Collaborator

If it's blocking - then we can merge this and I can rebase the changes into my working branch

@younesbelkada
Copy link
Contributor Author

Thanks @amyeroberts your explanation makes sense to me! I was not aware of #27007 and it is great that this issue is being addressed there.
Definitely ok for me to wait a bit before this gets merged! I just wanted to make sure users have a consistent API for multimodal models for the next release (i.e. avoid looping over the processor outputs), perhaps if #27007 is not ready for the release we can merge this PR first, what do you think?

@younesbelkada
Copy link
Contributor Author

Closing this PR as #27007 is going to be merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants