-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error occur in training diffusion #18
Comments
That's not a problem with git error, I solved the git error but the error following still exist |
In another word, after using the VAE training command you gave, executing the diffusion training you gave will result in an error |
You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well. |
But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not? |
Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions. |
Thanks |
At first I try to train coarse VAE using the given command
python train.py ./configs/shapenet/chair/train_vae_16x16x16_dense.yaml --wname 16x16x16-kld-0.03_dim-16 --max_epochs 100 --cut_ratio 16 --gpus 1 --batch_size 16
Due to the GPU different (My gpu is one A800 but 8 * V100 said in paper), I change the bs to 16 and set gradient_accumulation to 2.
After successfully coarse VAE training, I try to train coarse diffusion using the given command (still only bs and gradient_accumulation be changed)
python train.py ./configs/shapenet/chair/train_diffusion_16x16x16_dense.yaml --wname 16x16x16_kld-0.03 --eval_interval 5 --gpus 1 --batch_size 8 --accumulate_grad_batches 32
But error occur!!!
2024-07-19 15:47:45.053 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming.
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:
wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use
wandb login --relogin
to force reloginwandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240719_154747-rk4p0a77
wandb: Run
wandb offline
to turn off syncing.wandb: Syncing run chair_diffusion_dense/16x16x16_kld-0.03
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:258: LightningDeprecationWarning:
pytorch_lightning.utilities.distributed.rank_zero_only
has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it frompytorch_lightning.utilities
instead.rank_zero_deprecation(
2024-07-19 15:48:01.165 | INFO | xcube.modules.autoencoding.sunet:init:240 - latent dim: 16
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
wandb: 🚀 View run chair_diffusion_dense/16x16x16_kld-0.03 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../wandb/wandb/run-20240719_154747-rk4p0a77/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with
wandb.require("core")
! See https://wandb.me/wandb-core for more information.And their is no error using the ckpt download from your VAE
Could you please help me?Thanks
The text was updated successfully, but these errors were encountered: