Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training initialization issue #40

Open
buggyyang opened this issue Nov 4, 2024 · 3 comments
Open

Training initialization issue #40

buggyyang opened this issue Nov 4, 2024 · 3 comments

Comments

@buggyyang
Copy link

{'mid_block_add_attention', 'use_quant_conv', 'scaling_factor', 'force_upcast', 'shift_factor', 'latents_std', 'use_post_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
The config attributes {'center_input_sample': False, 'out_channels': 4} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'time_embedding_type', 'use_linear_projection', 'class_embeddings_concat', 'transformer_layers_per_block', 'time_embedding_dim', 'upcast_attention', 'time_cond_proj_dim', '_center_input_sample', 'projection_class_embeddings_input_dim', 'encoder_hid_dim', 'encoder_hid_dim_type', 'addition_embed_type_num_heads', '_out_channels', 'addition_embed_type', 'attention_type', 'only_cross_attention', 'dropout', 'class_embed_type', 'mid_block_only_cross_attention', 'time_embedding_act_fn', 'timestep_post_act', 'mid_block_type', 'num_class_embeds', 'dual_cross_attention', 'conv_in_kernel', 'resnet_time_scale_shift', 'num_attention_heads', 'reverse_transformer_layers_per_block', '_landmark_net', 'addition_time_embed_dim'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: 
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight']
The config attributes {'center_input_sample': False} were passed to UNet3DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'use_linear_projection', 'motion_module_decoder_only', 'motion_module_kwargs', 'only_cross_attention', 'class_embed_type', 'dual_cross_attention', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'motion_module_type', 'audio_attention_dim', 'motion_module_mid_block', 'num_class_embeds', 'resnet_time_scale_shift', 'upcast_attention', 'motion_module_resolutions', 'stack_enable_blocks_depth', 'use_audio_module', 'stack_enable_blocks_name'} was not found in config. Values will be initialized to default values.
11/01/2024 10:24:55 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
11/01/2024 10:24:56 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module

Hi, I have question regarding the initialization of training. Specifically, I cannot properly load the stable diffusion checkpoints for stage 1 training initialization. There seems to be some mismatch of the configuration. I wonder if there is anything wrong with the pretrained weights.

Thank you!

@cuijh26
Copy link
Contributor

cuijh26 commented Nov 4, 2024

stage 1 don't need to load motion module, so it's 0

@buggyyang
Copy link
Author

I see. Thanks! One more question, I wonder if there is anything else I need to pay attention to. Actually, the primary reason why I raised this issue is that the model always yields nan loss even at the beginning of the training. Except necessary path related configs, the only thing I changed in stage1.yaml is the batch size "train_bs: 4" as we do not have enough GPU memory for training. Do you have any insight about this? Thank you!

@buggyyang
Copy link
Author

@cuijh26 In the config file, weight_dtype is set as "fp16", which is actually not trainable due to low precision. However, if I set weight_dtype=fp32, mixed_precision="fp16" and train_bs=1. The model still exceeds the memory capacity of our V100 GPU (32GB memory), does it mean A100 is required for training such model? Or there is something else I need to change. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants