-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for image generation with Chameleon & Anole #32013
base: main
Are you sure you want to change the base?
Conversation
@zucchini-nlp @ArthurZucker this should now be ready for review the test errors seem to be related to huggingface_hub & bert & I'm now sure how they relate to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Looks good to me in general, the only thing is to make generation happy by moving code to the correct location and checking if we can guide users through interleaved generation with external FSM library
Also, we need tests for different generation modes, to make sure it's working correctly. Thsi can be added as a slow IntegrationTest in tests/models/chameleon/test_modeling_chameleon.py
src/transformers/models/chameleon/convert_chameleon_weights_to_hf.py
Outdated
Show resolved
Hide resolved
> Parameters specific to vision-language generation models such as [Chameleon](https://arxiv.org/abs/2405.09818v1) | ||
|
||
multimodal_generation_mode (`Literal["text-only", "image-only", "interleaved-text-image", "free"]`, *optional*, defaults to `None`): | ||
Chameleon can generate text, images, or both in an interleaved manner. However, only text generation is | ||
supported by the official model checkpoint. This flag enables the other modes for use with finetuned versions | ||
of the model such as [Anole](https://arxiv.org/abs/2407.06135). | ||
- If set to `"text-only"`, logits for image tokens will be masked out during generation. | ||
- If set to `"image-only"`, logits for non-image tokens will be masked out during generation. | ||
- If set to `"free"`, the logits are left as-is. | ||
- For `"interleaved-text-image"`, Chameleon implements a finite state machine to dynamically switch between text and image modalities. | ||
This library does not support this mode yet. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear, I meant Chameleon's generation config that is on the hub. Adding args to the general generation config is not a good idea, if it's going to be used only by one model.
Let's see how we can make generate()
happy. We can:
- When saving the model, add a field
model.generation_config="text-only"
by setting the default to text mode and saving it on the hub - Move Chameleon logits processor to the other processors, more comments below
- Add a
generate()
method in ConditionalGeneration module that does model-specific preparation, in our case takes generation mode and makes a LogitsProcessor out of it. Then callssuper().generate()
with all kwargs - Optionally, take generation output and run
decode_tokens
if image mode. In case you can make an example with external library for FSM and interleaved model, do separation of image from text here. And return a custom GenerationDecoderOnlyOutput, which will have an extra field forpixel values
Usually I would find a way w/o custom generate, but Chameleon can be an exception given that it's the only model that generates images. Also, custom generate in this way is less prone to bugs from refactoring general generate()
, as we only prepare and pass a model-specific processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zucchini-nlp!
I've:
- Implemented a new generation config for Chameleon
- Modified the utils to support custom generation config classes
- Added a custom
generate
func toChameleonForConditionalGeneration
class ChameleonGenerationConfig(GenerationConfig):
"""Generation Config for [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon)
Args:
multimodal_generation_mode (`Literal["text-only", "image-only", "interleaved-text-image", "unrestricted"]`, *optional*, defaults to `None`):
Chameleon can generate text, images, or both in an interleaved manner. However, only text generation is
supported by the official model checkpoint. This flag enables the other modes for use with finetuned versions
of the model such as [Anole](https://arxiv.org/abs/2407.06135).
- If set to `"unrestricted"`, the logits are left as-is.
- If set to `"text-only"`, logits for image tokens will be masked out during generation.
- If set to `"image-only"`, logits for non-image tokens will be masked out during generation.
- For `"interleaved-text-image"`, Chameleon implements a finite state machine to dynamically switch between text and image modalities.
Here, we simply use logits processors that exclusively allow image tokens to be generated within a relative window after the
begin image token and disallow them elsewhere.
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.multimodal_generation_mode = kwargs.pop("multimodal_generation_mode", "text-only")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multimodal_generation_mode
is now also in the converted model here: https://huggingface.co/leloy/Anole-7b-v0.1-hf/blob/main/generation_config.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, I guess that will be the official checkpoint after merging the PR right? We can then use it in IntegrationTests
src/transformers/models/chameleon/generation_configuration_chameleon.py
Outdated
Show resolved
Hide resolved
src/transformers/models/chameleon/image_processing_chameleon.py
Outdated
Show resolved
Hide resolved
@zucchini-nlp this should now be finished I think the failing test seems to have been caused by the issue here: #32094 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this model ❤️
Looks good to me, but we still don't have tests for image-only generation and interleaved-generation modes. We have to make sure the added model works correctly. I'm approving the PR, and will request review from core maintainers
Hi @leloykun , thank you for aweesome work! I run your example, but all the output are black... Is there anything missing here? Thank you!! |
Thanks for the feedback, @thaoshibe! Apparently, we just need to enable sampling during generation (by passing I've also just updated the docs. Please let me know if you encounter any more issues. One more thing: loading the model in model = ChameleonForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
attn_implementation="flash_attention_2",
device_map="auto",
) |
We can actually add those to generation config after uploading model to the hub. We did same for official Chameleon, it performs best with Same for bf16, can go the the docs in usage tips section. Then we can change all example snippets to bf16 |
Got it -- Thank youu @leloykun -- I ran your code and I got the correct output :D |
@zucchini-nlp , increasing the repetition penalty might not be good for image generation cuz a lot of the image tokens are repeating (e.g. snow tokens when generating a snowman). I'll only include the others for now then run a hp search |
The test errors are all huggingface hub related hnng |
@leloykun Just triggered a re-run. Hopefully just transient issues! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on adding this capability and adding examples!
Main comments are to do with properly testing this new feature, taking and returning torch tensors for the post processing method, and the max_new_tokens
behaviour for generating images
src/transformers/models/chameleon/image_processing_chameleon.py
Outdated
Show resolved
Hide resolved
if ( | ||
config.attn_resolutions is not None | ||
and curr_res in config.attn_resolutions | ||
and config.attn_type == "vanilla" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the possible values of config.attn_type
? I'm a bit worried this can be confused with attn_implementation
, a standard config param
if i_level != 0: | ||
up.upsample = ChameleonVQVAEDecoderConvUpsample(block_in) | ||
curr_res = curr_res * 2 | ||
self.up.insert(0, up) # prepend to get consistent order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't appending also give a consistent order? Inserting in the 0th index will be more expensive and then we don't need to reverse things in the forward
up = nn.Module() | ||
up.block = block | ||
up.attn = attn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an indication to me that we should have another class ChameleonVQVAEBlock
which within its init sets block
and attn
self, | ||
generation_config: Optional[GenerationConfig] = None, | ||
multimodal_generation_mode: Optional[ | ||
Literal["text-only", "image-only", "interleaved-text-image", "unrestricted"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good for documentation, but doesn't provide validation on this inputs. It would be good to add a check to make sure it's one of the accepted values if specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do raise an error a few lines further in if the value isn't recognized:
else:
raise ValueError(
f"Unknown multimodal generation mode: {generation_config.multimodal_generation_mode}. Please choose one of 'unrestricted', 'text-only', 'image-only', or 'interleaved-text-image'."
)
that'd suffice, right? Or should I move this to the start of the func?
return generation_config, model_kwargs | ||
|
||
@torch.no_grad() | ||
def generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the generation models "text-only", "image-only", "interleaved-text-image", "unrestricted" should be tested in the model tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments about running the example in the model doc.
processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf") | ||
model = ChameleonForConditionalGeneration.from_pretrained( | ||
"leloy/Anole-7b-v0.1-hf", | ||
device_map="auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example failed in my environment with 4 gpus, complaining about device unmatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minostauros can you provide the script you used for this? The complete error message would also help.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to remove the device_map="auto"
and manually send the model to specific cuda to properly run the code.
>>> import accelerate
>>> accelerate.__version__
'0.30.1'
>>> import torch
>>> from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
>>> from PIL import Image
>>>
>>> processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf")
Some kwargs in processor config are unused and will not have any effect: image_token, image_seq_length.
>>> model = ChameleonForConditionalGeneration.from_pretrained(
... "leloy/Anole-7b-v0.1-hf",
... device_map="auto",
... torch_dtype=torch.bfloat16,
... )
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.11it/s]
>>> model.device
device(type='cuda', index=0)
>>> url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
>>> image_snowman = Image.open(requests.get(url, stream=True).raw)
>>> prompt = "Generate a variation of this image.<image>"
>>> inputs = processor(
... prompt,
... images=[image_snowman],
... padding=True,
... return_tensors="pt",
... ).to(model.device, dtype=model.dtype)
>>> generate_ids = model.generate(
... **inputs,
... multimodal_generation_mode="image-only",
... # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token.
... max_new_tokens=1026,
... # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image.
... do_sample=True,
... )
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1821, in generate
return super().generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 1989, in generate
result = self._sample(
File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 2932, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1881, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1491, in forward
image_tokens = self.get_image_tokens(pixel_values)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1427, in get_image_tokens
return self.img2bpe_mapping_tensor[image_toks]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
>>> inputs.input_ids.device
device(type='cuda', index=0)
>>> inputs.keys()
dict_keys(['input_ids', 'attention_mask', 'pixel_values'])
>>> inputs.pixel_values.device
device(type='cuda', index=0)
>>> model = model.cuda()
You shouldn't move a model that is dispatched using accelerate hooks.
>>> generate_ids = model.generate(
... **inputs,
... multimodal_generation_mode="image-only",
... # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token.
... max_new_tokens=1026,
... # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image.
... do_sample=True,
... )
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1821, in generate
return super().generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 1989, in generate
result = self._sample(
File "/workspace/Github/transformers_anole/src/transformers/generation/utils.py", line 2932, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1881, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1491, in forward
image_tokens = self.get_image_tokens(pixel_values)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1426, in get_image_tokens
_, _, image_toks = self.vqmodel.encode(pixel_values)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 1159, in encode
hidden_states = self.encoder(pixel_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/workspace/Github/transformers_anole/src/transformers/models/chameleon/modeling_chameleon.py", line 979, in forward
hidden_states = [self.conv_in(pixel_values)]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)
>>> model = ChameleonForConditionalGeneration.from_pretrained(
... "leloy/Anole-7b-v0.1-hf",
... torch_dtype=torch.bfloat16,
... ).to(device=0)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.02it/s]
>>> generate_ids = model.generate(
... **inputs,
... multimodal_generation_mode="image-only",
... # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token.
... max_new_tokens=1026,
... # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image.
... do_sample=True,
... )
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
>>> generate_ids.shape
torch.Size([1, 2062])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating accelerate to 0.33.0
did not help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minostauros does this happen with the base Chameleon model? I.e. without this PR?
The issue with F.conv2d
may be unrelated to this PR but the issue with return self.img2bpe_mapping_tensor[image_toks]
definitely is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I've never seen this happen before but I suspect it's because of the .float()
(iirc, to_pil_image
rescales the numpy array if it's of float type). What happens if you remove it or cast the array to uint8?
btw, I wouldn't be able to run tests myself for the next few hours as I'm still traveling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you remove it or cast the array to uint8?
Great point!
# Decode the generated image tokens
pixel_values = model.decode_image_tokens(response_ids[:, 1:-1])
images = processor.postprocess_pixel_values(pixel_values)
# Save the image
from torchvision.transforms.functional import to_pil_image
images = [to_pil_image(img.detach().cpu()) for img in images]
images[0].save("snowman.png")
Perhaps just removing the 255 scaling and type casting in ChameleonImagProcessor.postprocess()
may also support torchvision.utils.save_image()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output after postprocessing should have the same shape, range, and dtype as the original image so it's better to keep it this way IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also just added a test for model sharding btw
pls check it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code now works like a charm! Thanks a lot for your contribution.
Besides, the output does not seem as good as the anole paper states.
prompt: 'A piece of paper with word like "Anole" written on it, and a drawing of an Anole.'
- from paper
- from "leloy/Anole-7b-v0.1-hf"
How may I improve the results?
Thanks @minostauros! I'll make sure to fix these in the next commit & tag you when it's ready |
Supporting Lumina-mGPT may be the next PR! |
dang, this looks cool does it support interleaved text-image generation too? |
hmmm one crucial difference is that Chameleon uses classifier free guidance while this doesn't I'll look into implementing it, but I think I'm gonna need help with that |
That doesn't sound right... We should calculate the CE loss using the BPE-compatible tokens (i.e. the tokens compatible with Chameleon's tokenizer). That's because those are the outputs of the decoder model. Pls check the img2bpe & bpe2img converter utils |
Thanks for reply. |
Apologies, I'm a bit confused Can you share your code so we can debug it together? |
# Disallow image tokens which does not include special begin-image and end-image tokens | ||
image_tokens = self.model.vocabulary_mapping.image_tokens | ||
logits[:, :, image_tokens] = torch.finfo(logits.dtype).min | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here labels not update, maybe we should udpate labels here for calculate ce loss. In addition, in order to deal with different numbers of images in each sample in a batch, my suggestion is to reshape input_ids
to fill image_tokens
through the view method.
Below I provide a draft code I used
if pixel_values is not None:
batch_size, sequence_length = input_ids.shape
input_ids = input_ids.view(batch_size * sequence_length)
image_tokens = self.model.get_image_tokens(pixel_values)
special_image_mask = input_ids == self.vocabulary_mapping.image_token_id
image_tokens = image_tokens.to(input_ids.device, input_ids.dtype)
input_ids = input_ids.masked_scatter(special_image_mask, image_tokens)
input_ids = input_ids.view(batch_size, sequence_length)
if labels is not None:
# update labels with new input_ids
mask = labels != -100
labels = torch.where(mask, input_ids, labels)
Hi, I have submitted a PR to your repository, you can refer to and modify this code for free |
Hi @YeLuoSuiYou! Thanks for the PR! My current understanding of the matter is:
(2) isn't actually a bug and your fix doesn't fit the library imo. We don't want the inputs and the outputs to interact before the calculating the loss in order to minimize bugs. cc @zucchini-nlp What you should do, instead, is to pass
|
I couldn't locate the PR but this is exactly what is expected. We (transformers) return |
@leloykun btw, seems like the most important thing left to add on this PR is the tests. Let me know if you need any help with that, would be super nice to have this PR merged soon :) |
Thanks for reply, I got it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm already using some of the features added by this PR and hope it goes upstream soon.
Thanks for your work!
@leloykun feel free to ping @zucchini-nlp again for a review, I'll do the final one afterwards! |
@leloykun hey, I can take over and write tests if you are busy, so we can merge faster. I think everything else was approved earlier |
Hi @zucchini-nlp ! I'd really appreciate it as I don't see myself being able to continue working on it for the next few weeks. |
@ArthurZucker ready for review! Added some more tests and removed unused logits processor to reduce maintainment burden. Actually I believe the generation might be done with simple prefix constraint and one new logit processor but didn't have time to check it out. Should be very similar to Emu3 inference |
What does this PR do?
VQVAE
decoder & also includes it in the conversion script.Reimplements Chameleon's FSM to be more compatible with Transformers and Outlines (for structured generation)This PR will not add the FSM, but instead just makes it easier for external libraries like Outlines & MMSG to integrate with Transformers to add the interleaved generation mode back inRequired TODOs:
max_length
ormax_new_tokens
on image-only generation modebegin-image-token
s with either anend-image-token
or an EOS token. And finetunes like Anole haven't fully removed this issue yet so they occasionally still does that.text-only
generation modeimage-only
generation modemax_new_tokens
is unset onimage-only
generation modeinterleaved-text-image
generation modeunrestricted
generation modequant_state_flattened_dims
which scales with the resolution instead of hardcoding it in the configsOptional TODOs or for future PRs:
mid
,down
, andup
blocks into explicit subclasses ofnn.Module()
(as suggested by @amyeroberts)Links:
(partially) Implements # (issue)
@ArthurZucker @zucchini-nlp @JoyBoy-Su