Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set #10918

kakukakujirori · 2025-02-27T11:20:19Z

What does this PR do?

A small bug is in LTXImageToVideoPipeline.prepare_latents() when latents is already set.
latents assumes five-dimensional input (batch, channel, num_frames, height, width) as we can see from the line

num_frames = (
    (num_frames - 1) // self.vae_temporal_compression_ratio + 1 if latents is None else latents.size(2)
)

However, when latents is set in the argument, the code skips applying self._pack_latents().

Also, the shape of conditioning_mask is wrong.

This PR addresses these two issues.

"""Code snippet to see the error
"""

import torch
from diffusers import LTXImageToVideoPipeline

device = "cuda:0"

# instantiate a pipeline
pipe = LTXImageToVideoPipeline.from_pretrained(
    "a-r-r-o-w/LTX-Video-0.9.1-diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload(device=device)

# create a dummy latents tensor
num_frames = 49
height = 352
width = 640

latent_num_frames = (num_frames - 1) // pipe.vae_temporal_compression_ratio + 1
latent_height = height // pipe.vae_spatial_compression_ratio
latent_width = width // pipe.vae_spatial_compression_ratio

latents = torch.randn((1, 128, latent_num_frames, latent_height, latent_width), device=device)

# run
pipe(
    height=height,
    width=width,
    num_frames=num_frames,
    prompt="test_test",
    latents=latents,
)

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 28
     25 latents = torch.randn((1, 128, latent_num_frames, latent_height, latent_width), device=device)
     27 # run
---> 28 pipe(
     29     height=height,
     30     width=width,
     31     num_frames=num_frames,
     32     prompt="test_test",
     33     latents=latents,
     34 )

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/diffusers/pipelines/ltx/pipeline_ltx_image2video.py:779, in LTXImageToVideoPipeline.__call__(self, image, prompt, negative_prompt, height, width, num_frames, frame_rate, num_inference_steps, timesteps, guidance_scale, num_videos_per_prompt, generator, latents, prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask, decode_timestep, decode_noise_scale, output_type, return_dict, attention_kwargs, callback_on_step_end, callback_on_step_end_tensor_inputs, max_sequence_length)
    777 # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
    778 timestep = t.expand(latent_model_input.shape[0])
--> 779 timestep = timestep.unsqueeze(-1) * (1 - conditioning_mask)
    781 noise_pred = self.transformer(
    782     hidden_states=latent_model_input,
    783     encoder_hidden_states=prompt_embeds,
   (...)
    791     return_dict=False,
    792 )[0]
    793 noise_pred = noise_pred.float()

RuntimeError: The size of tensor a (2) must match the size of tensor b (1540) at non-singleton dimension 1

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

a-r-r-o-w · 2025-02-27T11:32:56Z

I don't think there's a mistake with handling latents here. When user calls prepare_latents with their own latents, it is assumed to already be "prepared" (in this case, packed into ndim=3 tensor) and the only operation we wish to perform on the latent is device and dtype casting.

Regarding the mask shape, I believe that might be an actual mistake. Could you try running inference with only the mask_shape related change and passing ndim=3 latent?

kakukakujirori · 2025-02-27T12:01:49Z

When user calls prepare_latents with their own latents, it is assumed to already be "prepared" (in this case, packed into ndim=3 tensor)

This case also fails. Since the packed latents is of shape (batch, num_patch, num_channel), the line

num_frames = (
    (num_frames - 1) // self.vae_temporal_compression_ratio + 1 if latents is None else latents.size(2)
)

becomes equal to num_channel, which shouldn't be expected.

The following is the result, where

The input has been packed before being fed to the pipeline
latents packing is removed from prepare_latents()

"""Code snippet to see the error
"""

import torch
from diffusers import LTXImageToVideoPipeline

device = "cuda:0"

# instantiate a pipeline
pipe = LTXImageToVideoPipeline.from_pretrained(
    "a-r-r-o-w/LTX-Video-0.9.1-diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload(device=device)

# create a dummy latents tensor
num_frames = 49
height = 352
width = 640

latent_num_frames = (num_frames - 1) // pipe.vae_temporal_compression_ratio + 1
latent_height = height // pipe.vae_spatial_compression_ratio
latent_width = width // pipe.vae_spatial_compression_ratio

latents = torch.randn((1, 128, latent_num_frames, latent_height, latent_width), device=device)

latents = pipe._pack_latents(latents, pipe.transformer_spatial_patch_size, pipe.transformer_temporal_patch_size)

# run
pipe(
    height=height,
    width=width,
    num_frames=num_frames,
    prompt="test_test",
    latents=latents,
)

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 30
     27 latents = pipe._pack_latents(latents, pipe.transformer_spatial_patch_size, pipe.transformer_temporal_patch_size)
     29 # run
---> 30 pipe(
     31     height=height,
     32     width=width,
     33     num_frames=num_frames,
     34     prompt="test_test",
     35     latents=latents,
     36 )

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/diffusers/pipelines/ltx/pipeline_ltx_image2video.py:784, in LTXImageToVideoPipeline.__call__(self, image, prompt, negative_prompt, height, width, num_frames, frame_rate, num_inference_steps, timesteps, guidance_scale, num_videos_per_prompt, generator, latents, prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask, decode_timestep, decode_noise_scale, output_type, return_dict, attention_kwargs, callback_on_step_end, callback_on_step_end_tensor_inputs, max_sequence_length)
    781 timestep = t.expand(latent_model_input.shape[0])
    782 timestep = timestep.unsqueeze(-1) * (1 - conditioning_mask)
--> 784 noise_pred = self.transformer(
    785     hidden_states=latent_model_input,
    786     encoder_hidden_states=prompt_embeds,
    787     timestep=timestep,
    788     encoder_attention_mask=prompt_attention_mask,
    789     num_frames=latent_num_frames,
    790     height=latent_height,
    791     width=latent_width,
    792     rope_interpolation_scale=rope_interpolation_scale,
    793     attention_kwargs=attention_kwargs,
    794     return_dict=False,
    795 )[0]
    796 noise_pred = noise_pred.float()
    798 if self.do_classifier_free_guidance:

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/diffusers/models/transformers/transformer_ltx.py:440, in LTXVideoTransformer3DModel.forward(self, hidden_states, encoder_hidden_states, timestep, encoder_attention_mask, num_frames, height, width, rope_interpolation_scale, attention_kwargs, return_dict)
    430         hidden_states = torch.utils.checkpoint.checkpoint(
    431             create_custom_forward(block),
    432             hidden_states,
   (...)
    437             **ckpt_kwargs,
    438         )
    439     else:
--> 440         hidden_states = block(
    441             hidden_states=hidden_states,
    442             encoder_hidden_states=encoder_hidden_states,
    443             temb=temb,
    444             image_rotary_emb=image_rotary_emb,
    445             encoder_attention_mask=encoder_attention_mask,
    446         )
    448 scale_shift_values = self.scale_shift_table[None, None] + embedded_timestep[:, :, None]
    449 shift, scale = scale_shift_values[:, :, 0], scale_shift_values[:, :, 1]

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/miniconda3/envs/py312/lib/python3.12/site-packages/diffusers/models/transformers/transformer_ltx.py:245, in LTXVideoTransformerBlock.forward(self, hidden_states, encoder_hidden_states, temb, image_rotary_emb, encoder_attention_mask)
    243 ada_values = self.scale_shift_table[None, None] + temb.reshape(batch_size, temb.size(1), num_ada_params, -1)
    244 shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = ada_values.unbind(dim=2)
--> 245 norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
    247 attn_hidden_states = self.attn1(
    248     hidden_states=norm_hidden_states,
    249     encoder_hidden_states=None,
    250     image_rotary_emb=image_rotary_emb,
    251 )
    252 hidden_states = hidden_states + attn_hidden_states * gate_msa

RuntimeError: The size of tensor a (1540) must match the size of tensor b (28160) at non-singleton dimension 1

a-r-r-o-w · 2025-02-27T13:32:07Z

Ohh okay, I see! nice catch 🔥

cc @yiyixuxu What do we want to do here? Accept fully prepared latents from the user (ndim=3) and do a fix for that, or accept ndim=5 tensor and prepare it

yiyixuxu · 2025-03-04T01:28:05Z

cc @yiyixuxu What do we want to do here? Accept fully prepared latents from the user (ndim=3) and do a fix for that, or accept ndim=5 tensor and prepare it

I think it should be fully prepared latents (output of prepare_latents) and do a fix for that

kakukakujirori · 2025-03-04T12:50:03Z

Fixed. We can check the validity using the same code snippet above.

HuggingFaceDocBuilderDev · 2025-03-04T20:14:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu · 2025-03-04T22:44:17Z

src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py


        shape = (batch_size, num_channels_latents, num_frames, height, width)
        mask_shape = (batch_size, 1, num_frames, height, width)

        if latents is not None:
-            conditioning_mask = latents.new_zeros(shape)
+            conditioning_mask = latents.new_zeros(mask_shape)


why do we need this change?

The normal route of prepare_latents() outputs conditioning_mask with that shape, so it is natural to align with it (here).

github-actions · 2025-03-29T15:03:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yiyixuxu

thanks for the fix!
sorry I let this PR go stale!

@ic-synth

* Raise warning and round down if Wan num_frames is not 4k + 1 (huggingface#11167) * update * raise warning and round to nearest multiple of scale factor * [Docs] Fix environment variables in `installation.md` (huggingface#11179) * Add `latents_mean` and `latents_std` to `SDXLLongPromptWeightingPipeline` (huggingface#11034) * Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set (huggingface#10918) * Bug fix in ltx * Assume packed latents. --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com> * [tests] no hard-coded cuda (huggingface#11186) no cuda only * [WIP] Add Wan Video2Video (huggingface#11053) * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * map BACKEND_RESET_MAX_MEMORY_ALLOCATED to reset_peak_memory_stats on XPU (huggingface#11191) Signed-off-by: YAO Matrix <matrix.yao@intel.com> * fix autocast (huggingface#11190) Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix: for checking mandatory and optional pipeline components (huggingface#11189) fix: optional componentes verification on load * remove unnecessary call to `F.pad` (huggingface#10620) * rewrite memory count without implicitly using dimensions by @ic-synth * replace F.pad by built-in padding in Conv3D * in-place sums to reduce memory allocations * fixed trailing whitespace * file reformatted * in-place sums * simpler in-place expressions * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * reverted change * allow models to run with a user-provided dtype map instead of a single dtype (huggingface#10301) * allow models to run with a user-provided dtype map instead of a single dtype * make style * Add warning, change `_` to `default` * make style * add test * handle shared tensors * remove warning --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * [tests] HunyuanDiTControlNetPipeline inference precision issue on XPU (huggingface#11197) * add xpu part * fix more cases * remove some cases * no canny * format fix * Revert `save_model` in ModelMixin save_pretrained and use safe_serialization=False in test (huggingface#11196) * [docs] `torch_dtype` map (huggingface#11194) * Fix enable_sequential_cpu_offload in CogView4Pipeline (huggingface#11195) * Fix enable_sequential_cpu_offload in CogView4Pipeline * make fix-copies * SchedulerMixin from_pretrained and ConfigMixin Self type annotation (huggingface#11192) * Update import_utils.py (huggingface#10329) added onnxruntime-vitisai for custom build onnxruntime pkg * Add CacheMixin to Wan and LTX Transformers (huggingface#11187) * update * update * update * feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline (huggingface#11188) * feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline for Image SR. * added pipeline * [Model Card] standardize advanced diffusion training sdxl lora (huggingface#7615) * model card gen code * push modelcard creation * remove optional from params * add import * add use_dora check * correct lora var use in tags * make style && make quality --------- Co-authored-by: Aryan <aryan@huggingface.co> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Change KolorsPipeline LoRA Loader to StableDiffusion (huggingface#11198) Change LoRA Loader to StableDiffusion Replace the SDXL LoRA Loader Mixin inheritance with the StableDiffusion one * Update Style Bot workflow (huggingface#11202) update style bot workflow --------- Signed-off-by: YAO Matrix <matrix.yao@intel.com> Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: Aryan <aryan@huggingface.co> Co-authored-by: Mark <remarkablemark@users.noreply.github.com> Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: kakukakujirori <63725741+kakukakujirori@users.noreply.github.com> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: Fanli Lin <fanli.lin@intel.com> Co-authored-by: Yao Matrix <matrix.yao@intel.com> Co-authored-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: Eliseu Silva <elismasilva@gmail.com> Co-authored-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: lakshay sharma <31830611+Lakshaysharma048@users.noreply.github.com> Co-authored-by: Abhipsha Das <ad6489@nyu.edu> Co-authored-by: Basile Lewandowski <basile.lewan@gmail.com> Co-authored-by: célina <hanouticelina@gmail.com>

Bug fix in ltx

ba0f30a

Assume packed latents.

36e0dcb

Merge branch 'main' into ltx_latent_fix

e55314f

yiyixuxu reviewed Mar 4, 2025

View reviewed changes

github-actions bot added the stale Issues that haven't received updates label Mar 29, 2025

yiyixuxu removed the stale Issues that haven't received updates label Mar 31, 2025

Merge branch 'main' into ltx_latent_fix

5464993

yiyixuxu approved these changes Mar 31, 2025

View reviewed changes

yiyixuxu merged commit e8fc8b1 into huggingface:main Mar 31, 2025
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set #10918

Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set #10918

Uh oh!

kakukakujirori commented Feb 27, 2025

Uh oh!

a-r-r-o-w commented Feb 27, 2025

Uh oh!

kakukakujirori commented Feb 27, 2025

Uh oh!

a-r-r-o-w commented Feb 27, 2025

Uh oh!

yiyixuxu commented Mar 4, 2025

Uh oh!

kakukakujirori commented Mar 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 4, 2025

Uh oh!

yiyixuxu Mar 4, 2025

Uh oh!

kakukakujirori Mar 5, 2025

Uh oh!

github-actions bot commented Mar 29, 2025

Uh oh!

yiyixuxu left a comment

Uh oh!

Uh oh!

Uh oh!

Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set #10918

Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set #10918

Uh oh!

Conversation

kakukakujirori commented Feb 27, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

a-r-r-o-w commented Feb 27, 2025

Uh oh!

kakukakujirori commented Feb 27, 2025

Uh oh!

a-r-r-o-w commented Feb 27, 2025

Uh oh!

yiyixuxu commented Mar 4, 2025

Uh oh!

kakukakujirori commented Mar 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 4, 2025

Uh oh!

yiyixuxu Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

kakukakujirori Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 29, 2025

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!