-
-
Possible reason:
- Too many high-resolution frames for parallel decoding. The default setting will request ca. 66 GB peak VARM.
-
Try this:
- Reduce the number of jointly decoded frames en_and_decode_n_samples_a_time in
inference/vista.yaml
.
- Reduce the number of jointly decoded frames en_and_decode_n_samples_a_time in
-
-
-
Possible reason:
- A network failure.
-
Try this:
- Download openai/clip-vit-large-patch14 and laion/CLIP-ViT-H-14-laion2B-s32B-b79K in advance.
- Set version of FrozenCLIPEmbedder and FrozenOpenCLIPImageEmbedder in
vwm/modules/encoders/modules.py
to the new paths ofpytorch_model.bin
/open_clip_pytorch_model.bin
.
-
-
-
Possible reason:
- The dimension of cross-attention is not expended while the action conditions are injected, resulting in a mismatch.
-
Try this:
- Enable
action_control: True
in the YAML config file.
- Enable
-
<= Previous: [Sampling]