Training initialization issue #40

buggyyang · 2024-11-04T06:32:03Z

{'mid_block_add_attention', 'use_quant_conv', 'scaling_factor', 'force_upcast', 'shift_factor', 'latents_std', 'use_post_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
The config attributes {'center_input_sample': False, 'out_channels': 4} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'time_embedding_type', 'use_linear_projection', 'class_embeddings_concat', 'transformer_layers_per_block', 'time_embedding_dim', 'upcast_attention', 'time_cond_proj_dim', '_center_input_sample', 'projection_class_embeddings_input_dim', 'encoder_hid_dim', 'encoder_hid_dim_type', 'addition_embed_type_num_heads', '_out_channels', 'addition_embed_type', 'attention_type', 'only_cross_attention', 'dropout', 'class_embed_type', 'mid_block_only_cross_attention', 'time_embedding_act_fn', 'timestep_post_act', 'mid_block_type', 'num_class_embeds', 'dual_cross_attention', 'conv_in_kernel', 'resnet_time_scale_shift', 'num_attention_heads', 'reverse_transformer_layers_per_block', '_landmark_net', 'addition_time_embed_dim'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: 
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight']

The config attributes {'center_input_sample': False} were passed to UNet3DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'use_linear_projection', 'motion_module_decoder_only', 'motion_module_kwargs', 'only_cross_attention', 'class_embed_type', 'dual_cross_attention', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'motion_module_type', 'audio_attention_dim', 'motion_module_mid_block', 'num_class_embeds', 'resnet_time_scale_shift', 'upcast_attention', 'motion_module_resolutions', 'stack_enable_blocks_depth', 'use_audio_module', 'stack_enable_blocks_name'} was not found in config. Values will be initialized to default values.
11/01/2024 10:24:55 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module

Hi, I have question regarding the initialization of training. Specifically, I cannot properly load the stable diffusion checkpoints for stage 1 training initialization. There seems to be some mismatch of the configuration. I wonder if there is anything wrong with the pretrained weights.

Thank you!

The text was updated successfully, but these errors were encountered:

cuijh26 · 2024-11-04T06:42:58Z

stage 1 don't need to load motion module, so it's 0

buggyyang · 2024-11-04T06:51:01Z

I see. Thanks! One more question, I wonder if there is anything else I need to pay attention to. Actually, the primary reason why I raised this issue is that the model always yields nan loss even at the beginning of the training. Except necessary path related configs, the only thing I changed in stage1.yaml is the batch size "train_bs: 4" as we do not have enough GPU memory for training. Do you have any insight about this? Thank you!

buggyyang · 2024-11-04T11:35:40Z

@cuijh26 In the config file, weight_dtype is set as "fp16", which is actually not trainable due to low precision. However, if I set weight_dtype=fp32, mixed_precision="fp16" and train_bs=1. The model still exceeds the memory capacity of our V100 GPU (32GB memory), does it mean A100 is required for training such model? Or there is something else I need to change. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training initialization issue #40

Training initialization issue #40

buggyyang commented Nov 4, 2024

cuijh26 commented Nov 4, 2024

buggyyang commented Nov 4, 2024

buggyyang commented Nov 4, 2024

Training initialization issue #40

Training initialization issue #40

Comments

buggyyang commented Nov 4, 2024

cuijh26 commented Nov 4, 2024

buggyyang commented Nov 4, 2024

buggyyang commented Nov 4, 2024