Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Error(s) in loading state_dict for Mamba2DModel: size mismatch for additional_embed: copying a param with shape torch.Size([1, 1026, 1536]) from checkpoint, the shape in current model is torch.Size([1, 258, 1536]). #9

Closed
lihao-doc opened this issue Jun 27, 2024 · 19 comments

Comments

@lihao-doc
Copy link

[2024-06-27 08:23:37,448] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
I0627 08:23:38.489170 136925989832512 eval_ldm_discrete.py:140] Process 0 using device: cuda
Counting ImageNet files from assets/datasets/ImageNet
Finish counting ImageNet files
Missing train samples: 1280444 < 1281167
1000 classes
cnt[:10]: tensor([1300., 1300., 1300., 1300., 1300., 1300., 1300., 1300., 1300., 1300.])
frac[:10]: [tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010)]
prepare the dataset for classifier free guidance with p_uncond=0.1
2024-06-27 08:23:41,511 - _cpp_lib.py - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.1.1)
Python 3.9.19 (you have 3.9.19)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
2024-06-27 08:23:56,201 - eval_ldm_discrete.py - load nnet from workdir/imagenet256_H_DiM/default/ckpts/425000.ckpt/nnet_ema.pth
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/lihao/DiM-DiffusionMamba/./eval_ldm_discrete.py:341 in │
│ │
│ 338 │
│ 339 │
│ 340 if name == "main": │
│ ❱ 341 │ app.run(main) │
│ 342 │
│ │
│ /home/lihao/anaconda3/envs/mamba-attn/lib/python3.9/site-packages/absl/app.py:308 in run │
│ │
│ 305 │ callback = _init_callbacks.popleft() │
│ 306 │ callback() │
│ 307 │ try: │
│ ❱ 308 │ _run_main(main, args) │
│ 309 │ except UsageError as error: │
│ 310 │ usage(shorthelp=True, detailed_error=error, exitcode=error.exitcode) │
│ 311 │ except: │
│ │
│ /home/lihao/anaconda3/envs/mamba-attn/lib/python3.9/site-packages/absl/app.py:254 in _run_main │
│ │
│ 251 │ atexit.register(profiler.print_stats) │
│ 252 │ sys.exit(profiler.runcall(main, argv)) │
│ 253 else: │
│ ❱ 254 │ sys.exit(main(argv)) │
│ 255 │
│ 256 │
│ 257 def call_exception_handlers(exception): │
│ │
│ /home/lihao/DiM-DiffusionMamba/./eval_ldm_discrete.py:337 in main │
│ │
│ 334 │ config = FLAGS.config │
│ 335 │ config.nnet_path = FLAGS.nnet_path │
│ 336 │ config.output_path = FLAGS.output_path │
│ ❱ 337 │ evaluate(config) │
│ 338 │
│ 339 │
│ 340 if name == "main": │
│ │
│ /home/lihao/DiM-DiffusionMamba/./eval_ldm_discrete.py:156 in evaluate │
│ │
│ 153 │ nnet = accelerator.prepare(nnet) │
│ 154 │ logging.info(f'load nnet from {config.nnet_path}') │
│ 155 │ if (config.nnet_path is not None) and (config.sample.algorithm != 'dpm_solver_upsamp │
│ ❱ 156 │ │ accelerator.unwrap_model(nnet).load_state_dict(torch.load(config.nnet_path, map

│ 157 │ else: │
│ 158 │ │ accelerator.unwrap_model(nnet) │
│ 159 │
│ │
│ /home/lihao/anaconda3/envs/mamba-attn/lib/python3.9/site-packages/torch/nn/modules/module.py:215 │
│ 2 in load_state_dict │
│ │
│ 2149 │ │ │ │ │ │ ', '.join(f'"{k}"' for k in missing_keys))) │
│ 2150 │ │ │
│ 2151 │ │ if len(error_msgs) > 0: │
│ ❱ 2152 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( │
│ 2153 │ │ │ │ │ │ │ self.class.name, "\n\t".join(error_msgs))) │
│ 2154 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │
│ 2155 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Error(s) in loading state_dict for Mamba2DModel:
size mismatch for additional_embed: copying a param with shape torch.Size([1, 1026, 1536]) from checkpoint, the shape in current model is torch.Size([1, 258, 1536]).

@tyshiwo1
Copy link
Owner

tyshiwo1 commented Jun 27, 2024

It seems that you load the weights of model trained on 256 \times 256 with the 512 \times 512 config. Can you offer me the config you used?

Since you get the message load nnet from workdir/imagenet256_H_DiM/default/ckpts/425000.ckpt/nnet_ema.pth, have you downloaded the checkpoint with the correct resolution ($256 \times 256$)?

@lihao-doc
Copy link
Author

lihao-doc commented Jun 27, 2024

imagenet256_H_DiM.py
mport ml_collections

def d(**kwargs):
"""Helper of creating a config dict."""
return ml_collections.ConfigDict(initial_dictionary=kwargs)

def get_config():
config = ml_collections.ConfigDict()

config.seed = 1234
config.pred = 'noise_pred'
config.z_shape = (4, 32, 32)

config.autoencoder = d(
    pretrained_path='assets/stable-diffusion/autoencoder_kl_ema.pth'
)

# config.gradient_accumulation_steps=2 # 1
config.max_grad_norm = 1.0

config.train = d(
    n_steps=750000, # 300000
    batch_size=768, 
    mode='cond',
    log_interval=10,
    eval_interval=5000,
    save_interval=25000, # 50000
)

config.optimizer = d(
    name='adamw',
    lr=0.0002, 
    weight_decay=0.03, 
    betas=(0.99, 0.99),
    eps=1e-15,
)

config.lr_scheduler = d(
    name='customized',
    warmup_steps=5000, 
)

learned_sigma = False
latent_size = 32
in_channels = 4 # 3
config.nnet = d( 
    name='Mamba_DiT_H_2',
    attention_head_dim=1536//1, num_attention_heads=1, num_layers=49, 
    in_channels=in_channels,
    num_embeds_ada_norm=1000,
    sample_size=latent_size,
    activation_fn="gelu-approximate", #"gelu-approximate",
    attention_bias=True,
    norm_elementwise_affine=False,
    norm_type="ada_norm_single", #"layer_norm",
    out_channels=in_channels*2 if learned_sigma else in_channels,
    patch_size=2, 
    mamba_d_state=16,
    mamba_d_conv=3, 
    mamba_expand=2,
    use_bidirectional_rnn=False,
    mamba_type='enc',
    nested_order=0,
    is_uconnect=True,
    no_ff=True,
    use_conv1d=True,
    is_extra_tokens=True,
    rms=True, 
    use_pad_token=True,
    use_a4m_adapter=True,
    drop_path_rate=0.0, 
    encoder_start_blk_id=1, 
    kv_as_one_token_idx=-1,
    num_2d_enc_dec_layers=6,
    pad_token_schedules=['dec_split', 'lateral'],
    is_absorb=False, 
    use_adapter_modules=True,
    sequence_schedule='dilated',
    sub_sequence_schedule=['reverse_single', 'layerwise_cross'],
    pos_encoding_type='learnable', 
    scan_pattern_len=4 -1,
    is_align_exchange_q_kv=False, 
    is_random_patterns=False, 
) 
config.gradient_checkpointing = False

config.dataset = d(
    name='imagenet',
    path='assets/datasets/ImageNet',
    resolution=256,
    cfg=True,
    p_uncond=0.1,
)

config.sample = d(
    sample_steps=50,
    n_samples=50000,
    mini_batch_size=25,  # the decoder is large
    algorithm='dpm_solver',
    cfg=True,
    scale=0.4,
    path=''
)

return config

downloaded the checkpoint with https://drive.google.com/drive/folders/1TTEXKKhnJcEV9jeZbZYlXjiPyV87ZhE0?usp=sharing

@lihao-doc
Copy link
Author

ImageNet 64x64: Put the standard ImageNet dataset (which contains the train and val directory) to assets/datasets/ImageNet.
ImageNet 256x256 and ImageNet 512x512: Extract ImageNet features according to scripts/extract_imagenet_feature.py.

Currently, I have downloaded the ImageNet dataset and placed it according to the prescribed path, but I have not processed it yet. Is it necessary to preprocess the dataset into a 256x256 format? Or does the program automatically handle the dataset formatting?

@tyshiwo1
Copy link
Owner

There is no necessary to preprocess the datasets where images are smaller than $256 \times 256$. Although this requires additional training time and GPU memory, it should not be too much.
For images larger than $512 \times 512$, you can preprocess the datasets like this, which saves a lot of training cost.

@tyshiwo1
Copy link
Owner

tyshiwo1 commented Jun 28, 2024

The path of the image samples in our imageNet dataset is like assets/datasets/ImageNet/train/n07747607/n07747607_61484.JPEG

@tyshiwo1
Copy link
Owner

tyshiwo1 commented Jun 28, 2024

I'm sorry I accidentally hit the edit key on your reply of uploading the config.😂

After reading, I think your provided config is correct. However, your checkpoint should not contain additional_embed with shape torch.Size([1, 1026, 1536]). Have you really downloaded the correct checkpoint? You may try to load the nnet.pth to check whether the evaluation can successfully performed (using this checkpoint for evaluation would get a worse FiD).

Have you loaded the checkpoint correctly? Since others have succeeded, it may not be a problem with me. #8 (comment)

@lihao-doc
Copy link
Author

CUDA_VISIBLE_DEVICES="0" python ./eval_ldm_discrete.py --config=configs/imagenet256_H_DiM.py --nnet_path='workdir/imagenet256_H_DiM/default/ckpts/425000.ckpt/nnet.pth'
[2024-06-28 10:25:06,226] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
I0628 10:25:07.460301 130068879759168 eval_ldm_discrete.py:140] Process 0 using device: cuda
Counting ImageNet files from assets/datasets/ImageNet
Finish counting ImageNet files
1000 classes
cnt[:10]: tensor([1300., 1300., 1300., 1300., 1300., 1300., 1300., 1300., 1300., 1300.])
frac[:10]: [tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010), tensor(0.0010)]
prepare the dataset for classifier free guidance with p_uncond=0.1
2024-06-28 10:25:10,717 - _cpp_lib.py - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.1.1)
Python 3.9.19 (you have 3.9.19)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
2024-06-28 10:25:26,171 - eval_ldm_discrete.py - load nnet from workdir/imagenet256_H_DiM/default/ckpts/425000.ckpt/nnet.pth
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/lihao/DiM-DiffusionMamba/./eval_ldm_discrete.py:341 in │
│ │
│ 338 │
│ 339 │
│ 340 if name == "main": │
│ ❱ 341 │ app.run(main) │
│ 342 │
│ │
│ /home/lihao/anaconda3/envs/mamba-attn/lib/python3.9/site-packages/absl/app.py:308 in run │
│ │
│ 305 │ callback = _init_callbacks.popleft() │
│ 306 │ callback() │
│ 307 │ try: │
│ ❱ 308 │ _run_main(main, args) │
│ 309 │ except UsageError as error: │
│ 310 │ usage(shorthelp=True, detailed_error=error, exitcode=error.exitcode) │
│ 311 │ except: │
│ │
│ /home/lihao/anaconda3/envs/mamba-attn/lib/python3.9/site-packages/absl/app.py:254 in _run_main │
│ │
│ 251 │ atexit.register(profiler.print_stats) │
│ 252 │ sys.exit(profiler.runcall(main, argv)) │
│ 253 else: │
│ ❱ 254 │ sys.exit(main(argv)) │
│ 255 │
│ 256 │
│ 257 def call_exception_handlers(exception): │
│ │
│ /home/lihao/DiM-DiffusionMamba/./eval_ldm_discrete.py:337 in main │
│ │
│ 334 │ config = FLAGS.config │
│ 335 │ config.nnet_path = FLAGS.nnet_path │
│ 336 │ config.output_path = FLAGS.output_path │
│ ❱ 337 │ evaluate(config) │
│ 338 │
│ 339 │
│ 340 if name == "main": │
│ │
│ /home/lihao/DiM-DiffusionMamba/./eval_ldm_discrete.py:156 in evaluate │
│ │
│ 153 │ nnet = accelerator.prepare(nnet) │
│ 154 │ logging.info(f'load nnet from {config.nnet_path}') │
│ 155 │ if (config.nnet_path is not None) and (config.sample.algorithm != 'dpm_solver_upsamp │
│ ❱ 156 │ │ accelerator.unwrap_model(nnet).load_state_dict(torch.load(config.nnet_path, map

│ 157 │ else: │
│ 158 │ │ accelerator.unwrap_model(nnet) │
│ 159 │
│ │
│ /home/lihao/anaconda3/envs/mamba-attn/lib/python3.9/site-packages/torch/nn/modules/module.py:215 │
│ 2 in load_state_dict │
│ │
│ 2149 │ │ │ │ │ │ ', '.join(f'"{k}"' for k in missing_keys))) │
│ 2150 │ │ │
│ 2151 │ │ if len(error_msgs) > 0: │
│ ❱ 2152 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( │
│ 2153 │ │ │ │ │ │ │ self.class.name, "\n\t".join(error_msgs))) │
│ 2154 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │
│ 2155 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Error(s) in loading state_dict for Mamba2DModel:
size mismatch for additional_embed: copying a param with shape torch.Size([1, 1026, 1536]) from checkpoint, the shape in current model is torch.Size([1, 258, 1536]).

Loading the nnet.pth still fails. Are you sure the model you uploaded is correct? I've noticed that the filenames for the 256-resolution and 512-resolution models are identical. The configuration provided by the other individual suggests they might have been using a model they trained themselves. Currently, I need to load the model that you trained.

@lihao-doc
Copy link
Author

Could you please send me the trained model for 256 resolution?

@tyshiwo1
Copy link
Owner

OK, I will upload my best 256 model later

@lihao-doc
Copy link
Author

After successfully sending it, could you please provide me with a link or privately send a copy to my email address HaiLi086@163.com? I am highly interested in your work and would greatly appreciate it!

@tyshiwo1
Copy link
Owner

Thank you for your appreciation!

I would upload it into this repo, and update this:

ImageNet 256x256 (Huge/2) | 2.21 | 625K | 768 -- | -- | -- | --

@lihao-doc
Copy link
Author

I previously downloaded it from here: ImageNet 256x256 (Huge/2) 2.40 425K 768

@tyshiwo1
Copy link
Owner

If you have not prepared your dataset well, you can modify this line of your config to

config.dataset = d(
        name='imagenet256_features',
        path='assets/datasets/imagenet256_features',
        cfg=True,
        p_uncond=0.1
    )

This setting requires NO prepared datasets for evaluation

@tyshiwo1
Copy link
Owner

I previously downloaded it from here: ImageNet 256x256 (Huge/2) 2.40 425K 768

I know. I would give you a new link.

@lihao-doc
Copy link
Author

How do I prepare the dataset? I'm unable to properly run the script file scripts/extract_imagenet_feature.py.

python scripts/extract_imagenet_feature.py
usage: extract_imagenet_feature.py [-h] path
extract_imagenet_feature.py: error: the following arguments are required: path

My dataset path is: /home/lihao/DiM-DiffusionMamba/assets/datasets/ImageNet/train/n01440764/n01440764_18.JPEG. The images in my dataset have been downloaded but not processed further. How come there is an imagenet256_features folder?

@tyshiwo1
Copy link
Owner

Here is the best 256 model: https://drive.google.com/drive/folders/1ETllUm8Dpd8-vDHefQEXEWF9whdbyhL5?usp=sharing

You can place the new checkpoint to the path ./workdir/imagenet256_H_mambaenc_pad_cross_conv_skip1_2scan_vaeema_ada_4scan/default/ckpts/625000.ckpt/.

Then, execute this ( I just tested it, and it works well ):

accelerate launch --multi_gpu --gpu_ids 0,1 --main_process_port 20039 --num_processes 2 --mixed_precision bf16 ./eval_ldm_discrete.py --config=configs/imagenet256_H_DiM.py --nnet_path='workdir/imagenet256_H_mambaenc_pad_cross_conv_skip1_2scan_vaeema_ada_4scan/default/ckpts/625000.ckpt/nnet_ema_256_625k.pth'

@tyshiwo1
Copy link
Owner

How do I prepare the dataset? I'm unable to properly run the script file scripts/extract_imagenet_feature.py.

python scripts/extract_imagenet_feature.py usage: extract_imagenet_feature.py [-h] path extract_imagenet_feature.py: error: the following arguments are required: path

My dataset path is: /home/lihao/DiM-DiffusionMamba/assets/datasets/ImageNet/train/n01440764/n01440764_18.JPEG. The images in my dataset have been downloaded but not processed further. How come there is an imagenet256_features folder?

First, I do not use the latent extraction for 256 features in the configs of this open source code.
Second, extract_imagenet_feature.py: error: the following arguments are required: path means you need to type a path like python scripts/extract_imagenet_feature.py /home/lihao/DiM-DiffusionMamba/assets/datasets/ImageNet

@lihao-doc
Copy link
Author

Thank you for your meticulous guidance; I have resolved all of my issues.

@tyshiwo1
Copy link
Owner

Thank you for your meticulous guidance; I have resolved all of my issues.

OK, I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants