Improve support for image generation with Chameleon & Anole #32013

leloykun · 2024-07-17T06:31:55Z

What does this PR do?

Adds modelling for the VQVAE decoder & also includes it in the conversion script.
- I've also uploaded the converted model here: https://huggingface.co/leloy/Anole-7b-v0.1-hf
Adds support for decoding the BPE tokens -> discrete image tokens -> pixel values
Moves masking of image tokens in text-only generation mode to a LogitsProcessor.
Adds masking of non-image tokens for image-only generation mode.
~~Reimplements Chameleon's FSM to be more compatible with Transformers and Outlines (for structured generation)~~
- ~~This PR will not add the FSM, but instead just makes it easier for external libraries like Outlines & MMSG to integrate with Transformers to add the interleaved generation mode back in~~
- Reimplements Chameleon's Finite-State Machine that it uses to dynamically switch between text- and image-generation modes as Logits Processors. We can now support interleaved text-image generation natively.

Required TODOs:

Optional TODOs or for future PRs:

Links:

Multimodal Structured Generation: Repo that supports interleaved text-image generation and interleaved multimodal structured generation
Official Anole repo by GAIR-NLP team
Official Anole model by GAIR-NLP team
Transformers-compatible Anole with image generation support

(partially) Implements # (issue)

Multimodal-in and multimodal-out GAIR-NLP/anole#18

@ArthurZucker @zucchini-nlp @JoyBoy-Su

leloykun · 2024-07-19T10:50:14Z

@zucchini-nlp @ArthurZucker this should now be ready for review

the test errors seem to be related to huggingface_hub & bert & I'm now sure how they relate to this PR.

zucchini-nlp

Great job! Looks good to me in general, the only thing is to make generation happy by moving code to the correct location and checking if we can guide users through interleaved generation with external FSM library

Also, we need tests for different generation modes, to make sure it's working correctly. Thsi can be added as a slow IntegrationTest in tests/models/chameleon/test_modeling_chameleon.py

docs/source/en/model_doc/chameleon.md

src/transformers/image_transforms.py

src/transformers/models/chameleon/configuration_chameleon.py

src/transformers/models/chameleon/convert_chameleon_weights_to_hf.py

src/transformers/models/chameleon/logits_process_chameleon.py

src/transformers/models/chameleon/modeling_chameleon.py

zucchini-nlp · 2024-07-19T14:37:51Z

src/transformers/generation/configuration_utils.py

+        > Parameters specific to vision-language generation models such as [Chameleon](https://arxiv.org/abs/2405.09818v1)
+
+        multimodal_generation_mode (`Literal["text-only", "image-only", "interleaved-text-image", "free"]`, *optional*, defaults to `None`):
+            Chameleon can generate text, images, or both in an interleaved manner. However, only text generation is
+            supported by the official model checkpoint. This flag enables the other modes for use with finetuned versions
+            of the model such as [Anole](https://arxiv.org/abs/2407.06135).
+            - If set to `"text-only"`, logits for image tokens will be masked out during generation.
+            - If set to `"image-only"`, logits for non-image tokens will be masked out during generation.
+            - If set to `"free"`, the logits are left as-is.
+            - For `"interleaved-text-image"`, Chameleon implements a finite state machine to dynamically switch between text and image modalities.
+                This library does not support this mode yet.
+


Sorry if I wasn't clear, I meant Chameleon's generation config that is on the hub. Adding args to the general generation config is not a good idea, if it's going to be used only by one model.

Let's see how we can make generate() happy. We can:

When saving the model, add a field model.generation_config="text-only" by setting the default to text mode and saving it on the hub

Move Chameleon logits processor to the other processors, more comments below

Add a generate() method in ConditionalGeneration module that does model-specific preparation, in our case takes generation mode and makes a LogitsProcessor out of it. Then calls super().generate() with all kwargs

Optionally, take generation output and run decode_tokens if image mode. In case you can make an example with external library for FSM and interleaved model, do separation of image from text here. And return a custom GenerationDecoderOnlyOutput, which will have an extra field for pixel values

Usually I would find a way w/o custom generate, but Chameleon can be an exception given that it's the only model that generates images. Also, custom generate in this way is less prone to bugs from refactoring general generate(), as we only prepare and pass a model-specific processor.

Hi @zucchini-nlp!

I've:

Implemented a new generation config for Chameleon

Modified the utils to support custom generation config classes

Added a custom generate func to ChameleonForConditionalGeneration

class ChameleonGenerationConfig(GenerationConfig): """Generation Config for [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon) Args: multimodal_generation_mode (`Literal["text-only", "image-only", "interleaved-text-image", "unrestricted"]`, *optional*, defaults to `None`): Chameleon can generate text, images, or both in an interleaved manner. However, only text generation is supported by the official model checkpoint. This flag enables the other modes for use with finetuned versions of the model such as [Anole](https://arxiv.org/abs/2407.06135). - If set to `"unrestricted"`, the logits are left as-is. - If set to `"text-only"`, logits for image tokens will be masked out during generation. - If set to `"image-only"`, logits for non-image tokens will be masked out during generation. - For `"interleaved-text-image"`, Chameleon implements a finite state machine to dynamically switch between text and image modalities. Here, we simply use logits processors that exclusively allow image tokens to be generated within a relative window after the begin image token and disallow them elsewhere. """ def __init__(self, **kwargs): super().__init__(**kwargs) self.multimodal_generation_mode = kwargs.pop("multimodal_generation_mode", "text-only")

multimodal_generation_mode is now also in the converted model here: https://huggingface.co/leloy/Anole-7b-v0.1-hf/blob/main/generation_config.json

awesome, I guess that will be the official checkpoint after merging the PR right? We can then use it in IntegrationTests

src/transformers/generation/logits_process.py

src/transformers/models/chameleon/configuration_chameleon.py

src/transformers/models/chameleon/generation_configuration_chameleon.py

src/transformers/models/chameleon/image_processing_chameleon.py

src/transformers/models/chameleon/modeling_chameleon.py

src/transformers/generation/logits_process.py

docs/source/en/model_doc/chameleon.md

leloykun · 2024-07-22T10:15:21Z

@zucchini-nlp this should now be finished I think

the failing test seems to have been caused by the issue here: #32094
which is unrelated to this PR

zucchini-nlp

Thanks for adding this model ❤️

Looks good to me, but we still don't have tests for image-only generation and interleaved-generation modes. We have to make sure the added model works correctly. I'm approving the PR, and will request review from core maintainers

thaoshibe · 2024-07-22T20:19:35Z

Hi @leloykun , thank you for aweesome work!

I run your example, but all the output are black... Is there anything missing here?

Thank you!!

leloykun · 2024-07-23T01:25:58Z

Thanks for the feedback, @thaoshibe!

Apparently, we just need to enable sampling during generation (by passing do_sample=True to .generate). If I'm not mistaken, this is because most of the image tokens during training were for "empty" patches. So, greedy decoding of image tokens wouldn't work well.

I've also just updated the docs. Please let me know if you encounter any more issues.

One more thing: loading the model in bfloat16 (the dtype used for finetuning Anole) also seems to improve generation. See:

model = ChameleonForConditionalGeneration.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            attn_implementation="flash_attention_2",
            device_map="auto",
)

zucchini-nlp · 2024-07-23T05:28:31Z

We can actually add those to generation config after uploading model to the hub. We did same for official Chameleon, it performs best with do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2

Same for bf16, can go the the docs in usage tips section. Then we can change all example snippets to bf16

thaoshibe · 2024-07-23T05:30:30Z

Got it -- Thank youu @leloykun -- I ran your code and I got the correct output :D

leloykun · 2024-07-23T05:56:28Z

We can actually add those to generation config after uploading model to the hub. We did same for official Chameleon, it performs best with do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2

Same for bf16, can go the the docs in usage tips section. Then we can change all example snippets to bf16

@zucchini-nlp , increasing the repetition penalty might not be good for image generation cuz a lot of the image tokens are repeating (e.g. snow tokens when generating a snowman).

I'll only include the others for now then run a hp search

docs/source/en/model_doc/chameleon.md

leloykun · 2024-07-28T02:56:13Z

The test errors are all huggingface hub related hnng

amyeroberts · 2024-07-29T10:27:58Z

@leloykun Just triggered a re-run. Hopefully just transient issues!

amyeroberts

Thanks for all the work on adding this capability and adding examples!

Main comments are to do with properly testing this new feature, taking and returning torch tensors for the post processing method, and the max_new_tokens behaviour for generating images

src/transformers/image_transforms.py

src/transformers/models/chameleon/image_processing_chameleon.py

amyeroberts · 2024-07-30T18:48:01Z

src/transformers/models/chameleon/modeling_chameleon.py

+                if (
+                    config.attn_resolutions is not None
+                    and curr_res in config.attn_resolutions
+                    and config.attn_type == "vanilla"


What are the possible values of config.attn_type? I'm a bit worried this can be confused with attn_implementation, a standard config param

amyeroberts · 2024-07-30T18:50:54Z

src/transformers/models/chameleon/modeling_chameleon.py

+            if i_level != 0:
+                up.upsample = ChameleonVQVAEDecoderConvUpsample(block_in)
+                curr_res = curr_res * 2
+            self.up.insert(0, up)  # prepend to get consistent order


Wouldn't appending also give a consistent order? Inserting in the 0th index will be more expensive and then we don't need to reverse things in the forward

amyeroberts · 2024-07-30T18:51:43Z

src/transformers/models/chameleon/modeling_chameleon.py

+            up = nn.Module()
+            up.block = block
+            up.attn = attn


This is an indication to me that we should have another class ChameleonVQVAEBlock which within its init sets block and attn

amyeroberts · 2024-07-30T18:56:31Z

src/transformers/models/chameleon/modeling_chameleon.py

+        self,
+        generation_config: Optional[GenerationConfig] = None,
+        multimodal_generation_mode: Optional[
+            Literal["text-only", "image-only", "interleaved-text-image", "unrestricted"]


This is good for documentation, but doesn't provide validation on this inputs. It would be good to add a check to make sure it's one of the accepted values if specified

@amyeroberts

we do raise an error a few lines further in if the value isn't recognized:

else: raise ValueError( f"Unknown multimodal generation mode: {generation_config.multimodal_generation_mode}. Please choose one of 'unrestricted', 'text-only', 'image-only', or 'interleaved-text-image'." )

that'd suffice, right? Or should I move this to the start of the func?

amyeroberts · 2024-07-30T18:58:10Z

src/transformers/models/chameleon/modeling_chameleon.py

+        return generation_config, model_kwargs
+
+    @torch.no_grad()
+    def generate(


All of the generation models "text-only", "image-only", "interleaved-text-image", "unrestricted" should be tested in the model tests

minostauros

Some comments about running the example in the model doc.

docs/source/en/model_doc/chameleon.md

minostauros · 2024-08-05T11:08:02Z

docs/source/en/model_doc/chameleon.md

+processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf")
+model = ChameleonForConditionalGeneration.from_pretrained(
+    "leloy/Anole-7b-v0.1-hf",
+    device_map="auto",


This example failed in my environment with 4 gpus, complaining about device unmatch.

@minostauros can you provide the script you used for this? The complete error message would also help.

Thank you!

I needed to remove the device_map="auto" and manually send the model to specific cuda to properly run the code.

>>> import accelerate >>> accelerate.__version__ '0.30.1' >>> import torch >>> from transformers import ChameleonProcessor, ChameleonForConditionalGeneration >>> from PIL import Image >>> >>> processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf") Some kwargs in processor config are unused and will not have any effect: image_token, image_seq_length. >>> model = ChameleonForConditionalGeneration.from_pretrained( ... "leloy/Anole-7b-v0.1-hf", ... device_map="auto", ... torch_dtype=torch.bfloat16, ... ) Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.11it/s] >>> model.device device(type='cuda', index=0) >>> url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg" >>> image_snowman = Image.open(requests.get(url, stream=True).raw) >>> prompt = "Generate a variation of this image.<image>" >>> inputs = processor( ... prompt, ... images=[image_snowman], ... padding=True, ... return_tensors="pt", ... ).to(model.device, dtype=model.dtype) >>> generate_ids = model.generate( ... **inputs, ... multimodal_generation_mode="image-only", ... # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token. ... max_new_tokens=1026, ... # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image. ... do_sample=True, ... ) Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1821, in generate return super().generate( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 1989, in generate result = self._sample( File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 2932, in _sample outputs = self(**model_inputs, return_dict=True) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1881, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1491, in forward image_tokens = self.get_image_tokens(pixel_values) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1427, in get_image_tokens return self.img2bpe_mapping_tensor[image_toks] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) >>> inputs.input_ids.device device(type='cuda', index=0) >>> inputs.keys() dict_keys(['input_ids', 'attention_mask', 'pixel_values']) >>> inputs.pixel_values.device device(type='cuda', index=0) >>> model = model.cuda() You shouldn't move a model that is dispatched using accelerate hooks. >>> generate_ids = model.generate( ... **inputs, ... multimodal_generation_mode="image-only", ... # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token. ... max_new_tokens=1026, ... # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image. ... do_sample=True, ... ) Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1821, in generate return super().generate( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 1989, in generate result = self._sample( File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 2932, in _sample outputs = self(**model_inputs, return_dict=True) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1881, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1491, in forward image_tokens = self.get_image_tokens(pixel_values) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1426, in get_image_tokens _, _, image_toks = self.vqmodel.encode(pixel_values) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1159, in encode hidden_states = self.encoder(pixel_values) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 979, in forward hidden_states = [self.conv_in(pixel_values)] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 460, in forward return self._conv_forward(input, self.weight, self.bias) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution) >>> model = ChameleonForConditionalGeneration.from_pretrained( ... "leloy/Anole-7b-v0.1-hf", ... torch_dtype=torch.bfloat16, ... ).to(device=0) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.02it/s] >>> generate_ids = model.generate( ... **inputs, ... multimodal_generation_mode="image-only", ... # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token. ... max_new_tokens=1026, ... # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image. ... do_sample=True, ... ) Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. >>> generate_ids.shape torch.Size([1, 2062])

Updating accelerate to 0.33.0 did not help.

@minostauros does this happen with the base Chameleon model? I.e. without this PR?

The issue with F.conv2d may be unrelated to this PR but the issue with return self.img2bpe_mapping_tensor[image_toks] definitely is

hm, I've never seen this happen before but I suspect it's because of the .float() (iirc, to_pil_image rescales the numpy array if it's of float type). What happens if you remove it or cast the array to uint8?

btw, I wouldn't be able to run tests myself for the next few hours as I'm still traveling

What happens if you remove it or cast the array to uint8?

Great point!

# Decode the generated image tokens pixel_values = model.decode_image_tokens(response_ids[:, 1:-1]) images = processor.postprocess_pixel_values(pixel_values) # Save the image from torchvision.transforms.functional import to_pil_image images = [to_pil_image(img.detach().cpu()) for img in images] images[0].save("snowman.png")

Perhaps just removing the 255 scaling and type casting in ChameleonImagProcessor.postprocess() may also support torchvision.utils.save_image().

The output after postprocessing should have the same shape, range, and dtype as the original image so it's better to keep it this way IMO

I've also just added a test for model sharding btw

pls check it out!

The code now works like a charm! Thanks a lot for your contribution.
Besides, the output does not seem as good as the anole paper states.

prompt: 'A piece of paper with word like "Anole" written on it, and a drawing of an Anole.'

from paper

from "leloy/Anole-7b-v0.1-hf"

How may I improve the results?

leloykun · 2024-08-05T11:23:29Z

Some comments about running the example in the model doc.

Thanks @minostauros! I'll make sure to fix these in the next commit & tag you when it's ready

minostauros · 2024-08-09T15:20:40Z

Supporting Lumina-mGPT may be the next PR!

leloykun · 2024-08-09T15:24:00Z

Supporting Lumina-mGPT may be the next PR!

dang, this looks cool

does it support interleaved text-image generation too?

minostauros · 2024-08-09T15:32:59Z

does it support interleaved text-image generation too?

Sadly none of their examples show interleaved text-image generation while it is mentioned the model is trained with interleaved data.

One interesting approach is the prompt is aware of image resolutions.

And here's a unique generation parameter handling of Lumina-mGPT

leloykun · 2024-08-11T13:56:38Z

hmmm

one crucial difference is that Chameleon uses classifier free guidance while this doesn't

I'll look into implementing it, but I think I'm gonna need help with that

leloykun · 2024-09-02T16:16:00Z

Hi, I find a bug, if u use ChameleonMoeForConditionalGeneration and calculate ce loss, u should use new special image tokens encoded by vqmodel to update the labels for ce loss calculation~ @leloykun

That doesn't sound right...

We should calculate the CE loss using the BPE-compatible tokens (i.e. the tokens compatible with Chameleon's tokenizer). That's because those are the outputs of the decoder model.

Pls check the img2bpe & bpe2img converter utils

YeLuoSuiYou · 2024-09-02T16:30:31Z

Hi, I find a bug, if u use ChameleonForConditionalGeneration and calculate ce loss, u should use new special image tokens encoded by vqmodel to update the labels for ce loss calculation~ @leloykun

That doesn't sound right...

We should calculate the CE loss using the BPE-compatible tokens (i.e. the tokens compatible with Chameleon's tokenizer). That's because those are the outputs of the decoder model.

Pls check the img2bpe & bpe2img converter utils

Thanks for reply.
Yes, we should calculate CE loss using BPE tokens instead of "image" (id is 8711), but we only can get BPE token in the ChameleonModel and not return the input ids in the output to update the labels in ChameleonForConditionalGeneration. So my suggestion is that when using ChameleonForConditionalGeneration, we can get BPE tokens encoded by vqmodel in ChameleonForConditionalGeneration forward function to update input_ids and labels instead of in ChameleonModel

leloykun · 2024-09-02T16:40:58Z

Hi, I find a bug, if u use ChameleonForConditionalGeneration and calculate ce loss, u should use new special image tokens encoded by vqmodel to update the labels for ce loss calculation~ @leloykun

That doesn't sound right...

We should calculate the CE loss using the BPE-compatible tokens (i.e. the tokens compatible with Chameleon's tokenizer). That's because those are the outputs of the decoder model.

Pls check the img2bpe & bpe2img converter utils

Thanks for reply.
Yes, we should calculate CE loss using BPE tokens instead of "image" (id is 8711), but we only can get BPE token in the ChameleonModel and not return the input ids in the output to update the labels in ChameleonForConditionalGeneration. So my suggestion is that when using ChameleonForConditionalGeneration, we can get BPE tokens encoded by vqmodel in ChameleonForConditionalGeneration forward function to update input_ids and labels instead of in ChameleonModel

Apologies, I'm a bit confused

Can you share your code so we can debug it together?

YeLuoSuiYou · 2024-09-03T05:01:03Z

src/transformers/models/chameleon/modeling_chameleon.py

-        # Disallow image tokens which does not include special begin-image and end-image tokens
-        image_tokens = self.model.vocabulary_mapping.image_tokens
-        logits[:, :, image_tokens] = torch.finfo(logits.dtype).min
-


here labels not update, maybe we should udpate labels here for calculate ce loss. In addition, in order to deal with different numbers of images in each sample in a batch, my suggestion is to reshape input_ids to fill image_tokens through the view method.

Below I provide a draft code I used

if pixel_values is not None: batch_size, sequence_length = input_ids.shape input_ids = input_ids.view(batch_size * sequence_length) image_tokens = self.model.get_image_tokens(pixel_values) special_image_mask = input_ids == self.vocabulary_mapping.image_token_id image_tokens = image_tokens.to(input_ids.device, input_ids.dtype) input_ids = input_ids.masked_scatter(special_image_mask, image_tokens) input_ids = input_ids.view(batch_size, sequence_length) if labels is not None: # update labels with new input_ids mask = labels != -100 labels = torch.where(mask, input_ids, labels)

YeLuoSuiYou · 2024-09-03T05:32:33Z

Hi, I find a bug, if u use ChameleonForConditionalGeneration and calculate ce loss, u should use new special image tokens encoded by vqmodel to update the labels for ce loss calculation~ @leloykun

That doesn't sound right...
We should calculate the CE loss using the BPE-compatible tokens (i.e. the tokens compatible with Chameleon's tokenizer). That's because those are the outputs of the decoder model.
Pls check the img2bpe & bpe2img converter utils

Thanks for reply.
Yes, we should calculate CE loss using BPE tokens instead of "image" (id is 8711), but we only can get BPE token in the ChameleonModel and not return the input ids in the output to update the labels in ChameleonForConditionalGeneration. So my suggestion is that when using ChameleonForConditionalGeneration, we can get BPE tokens encoded by vqmodel in ChameleonForConditionalGeneration forward function to update input_ids and labels instead of in ChameleonModel

Apologies, I'm a bit confused

Can you share your code so we can debug it together?

Hi, I have submitted a PR to your repository, you can refer to and modify this code for free

leloykun · 2024-09-03T17:52:16Z

Hi @YeLuoSuiYou! Thanks for the PR!

My current understanding of the matter is:

Internally, if pixel_values is not None, then we use the vqmodel to tokenize them and add the tokens to the input_ids
But we never touch the labels, regardless of whether pixel_values is None or not. We keep it as-is during training.

(2) isn't actually a bug and your fix doesn't fit the library imo. We don't want the inputs and the outputs to interact before the calculating the loss in order to minimize bugs. cc @zucchini-nlp

What you should do, instead, is to pass labels with the image tokens already added to it. Here's what I do in my finetuning scripts:

Pass text and images to the processor. Get input_ids and pixel_values in return.
Clone the input_ids as labels (i.e. labels = input_ids.clone())
Use the vqmodel to tokenize the pixel_values
Add the tokens to labels
Pass input_ids, pixel_values, & labels to the model

zucchini-nlp · 2024-09-03T18:39:20Z

What you should do, instead, is to pass labels with the image tokens already added to it.

I couldn't locate the PR but this is exactly what is expected. We (transformers) return input_ids already expanded to account for the image tokens and the user has only to clone input-ids and mask pad-token-ids, similar to Language Modeling task. And Chameleon, as one of the latest VLMs added to the library, follows this. Older VLMs like LLaVA are still in progress and might need more intervention from the user

zucchini-nlp · 2024-09-03T18:40:58Z

@leloykun btw, seems like the most important thing left to add on this PR is the tests. Let me know if you need any help with that, would be super nice to have this PR merged soon :)

YeLuoSuiYou · 2024-09-04T12:29:20Z

Hi @YeLuoSuiYou! Thanks for the PR!

My current understanding of the matter is:

Internally, if pixel_values is not None, then we use the vqmodel to tokenize them and add the tokens to the input_ids

But we never touch the labels, regardless of whether pixel_values is None or not. We keep it as-is during training.

(2) isn't actually a bug and your fix doesn't fit the library imo. We don't want the inputs and the outputs to interact before the calculating the loss in order to minimize bugs. cc @zucchini-nlp

What you should do, instead, is to pass labels with the image tokens already added to it. Here's what I do in my finetuning scripts:相反，您应该做的是传递已添加图像标记_的_labels 。这是我在微调脚本中所做的事情：

Pass text and images to the processor. Get input_ids and pixel_values in return.

Clone the input_ids as labels (i.e. labels = input_ids.clone())

Use the vqmodel to tokenize the pixel_values

Add the tokens to labels

Pass input_ids, pixel_values, & labels to the model

Thanks for reply, I got it

…n & anole

minostauros

I'm already using some of the features added by this PR and hope it goes upstream soon.
Thanks for your work!

ArthurZucker · 2024-11-19T15:20:16Z

@leloykun feel free to ping @zucchini-nlp again for a review, I'll do the final one afterwards!

zucchini-nlp · 2024-11-19T15:36:29Z

@leloykun hey, I can take over and write tests if you are busy, so we can merge faster. I think everything else was approved earlier

leloykun · 2024-11-20T15:28:37Z

Hi @zucchini-nlp ! I'd really appreciate it as I don't see myself being able to continue working on it for the next few weeks.

zucchini-nlp · 2024-12-03T12:09:40Z

@ArthurZucker ready for review! Added some more tests and removed unused logits processor to reduce maintainment burden. Actually I believe the generation might be done with simple prefix constraint and one new logit processor but didn't have time to check it out. Should be very similar to Emu3 inference

moses mentioned this pull request Jul 17, 2024

Fix typo in classification function selection logic to improve code consistency #32031

Merged

leloykun force-pushed the fc--anole branch from 6695bb4 to 55747db Compare July 19, 2024 10:39

leloykun mentioned this pull request Jul 19, 2024

Inf Loss Problem When Training GAIR-NLP/anole#30

Closed

zucchini-nlp reviewed Jul 19, 2024

View reviewed changes

zucchini-nlp reviewed Jul 22, 2024

View reviewed changes

zucchini-nlp approved these changes Jul 22, 2024

View reviewed changes

zucchini-nlp requested review from amyeroberts and ArthurZucker July 22, 2024 10:36

leloykun commented Jul 23, 2024

View reviewed changes

docs/source/en/model_doc/chameleon.md Show resolved Hide resolved

leloykun mentioned this pull request Jul 24, 2024

Uniformize kwargs for chameleon processor #32181

Merged

zucchini-nlp mentioned this pull request Jul 27, 2024

Chameleon image generation low quality. #32248

Open

4 tasks

amyeroberts reviewed Jul 30, 2024

View reviewed changes

minostauros reviewed Aug 5, 2024

View reviewed changes

leloykun force-pushed the fc--anole branch from c75d14e to d076b64 Compare August 8, 2024 14:28

leloykun marked this pull request as draft August 8, 2024 18:06

leloykun requested a review from minostauros August 9, 2024 12:48

YeLuoSuiYou reviewed Sep 3, 2024

View reviewed changes

YeLuoSuiYou pushed a commit to YeLuoSuiYou/transformers that referenced this pull request Sep 3, 2024

Fix issues in PR huggingface#32013

7607e4c

zucchini-nlp mentioned this pull request Sep 8, 2024

Support Unified Multimodal Model #33368

Open

zucchini-nlp mentioned this pull request Oct 7, 2024

Implement LlamaGen for Image Generation #33905

Open

add support for interleaved multimodal in&out generation for chameleo…

1ce50c6

…n & anole

leloykun force-pushed the fc--anole branch from dae439c to 1ce50c6 Compare October 27, 2024 06:43

minostauros approved these changes Nov 15, 2024

View reviewed changes

ArthurZucker removed their request for review November 19, 2024 15:20

zucchini-nlp added 6 commits December 2, 2024 12:18

Merge remote-tracking branch 'upstream/main' into fc--anole

d5c0be1

add some tests + enable VLM tests

de4a196

add vq-vae tester

65cbc08

update

bbc28fc

add logits process test

cb45340

fix tests

3182f74

zucchini-nlp marked this pull request as ready for review December 3, 2024 12:06

Merge branch 'main' into fc--anole

bf051b3

zucchini-nlp requested a review from ArthurZucker December 3, 2024 12:10

fix test

7b83d11

Improve support for image generation with Chameleon & Anole #32013

Are you sure you want to change the base?

Improve support for image generation with Chameleon & Anole #32013

Conversation

leloykun commented Jul 17, 2024 • edited Loading

What does this PR do?

leloykun commented Jul 19, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leloykun commented Jul 22, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

thaoshibe commented Jul 22, 2024

leloykun commented Jul 23, 2024

zucchini-nlp commented Jul 23, 2024 • edited Loading

thaoshibe commented Jul 23, 2024

leloykun commented Jul 23, 2024

leloykun commented Jul 28, 2024

amyeroberts commented Jul 29, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minostauros left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minostauros Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leloykun commented Aug 5, 2024

minostauros commented Aug 9, 2024

leloykun commented Aug 9, 2024

minostauros commented Aug 9, 2024 • edited Loading

leloykun commented Aug 11, 2024

leloykun commented Sep 2, 2024

YeLuoSuiYou commented Sep 2, 2024 • edited Loading

leloykun commented Sep 2, 2024

Choose a reason for hiding this comment

YeLuoSuiYou commented Sep 3, 2024

leloykun commented Sep 3, 2024

zucchini-nlp commented Sep 3, 2024

zucchini-nlp commented Sep 3, 2024

YeLuoSuiYou commented Sep 4, 2024

minostauros left a comment

Choose a reason for hiding this comment

ArthurZucker commented Nov 19, 2024

zucchini-nlp commented Nov 19, 2024

leloykun commented Nov 20, 2024

zucchini-nlp commented Dec 3, 2024

leloykun commented Jul 17, 2024 •

edited

Loading

zucchini-nlp commented Jul 23, 2024 •

edited

Loading

minostauros left a comment •

edited

Loading

minostauros Aug 6, 2024 •

edited

Loading

minostauros commented Aug 9, 2024 •

edited

Loading

YeLuoSuiYou commented Sep 2, 2024 •

edited

Loading