training loss nan #9

Zhiyuan-R · 2023-01-25T06:09:42Z

Hi, I train the vae model as the readme part tells. But the training loss become nan. I use 4 gpu and 40 batchsize. And I keep the left the same in the repo.

ZENGXH · 2023-01-25T19:33:35Z

Are you using the ShapeNet dataset as well? Can you share the training log here?

Zhiyuan-R · 2023-01-25T19:51:27Z

Yes! I use shapeNet v2 core 15k(downloading from PVD)

Zhiyuan-R · 2023-01-25T19:51:29Z

2023-01-25 00:17:56.473 | INFO | main:get_args:205 - EXP_ROOT: ./exp + exp name: 0125/car/3dbf3ah_hvae_lion_B40, save dir: ./exp/0125/car/3dbf3ah_hvae_lion_B40
2023-01-25 00:17:56.490 | INFO | main:get_args:210 - save config at ./exp/0125/car/3dbf3ah_hvae_lion_B40/cfg.yml
2023-01-25 00:17:56.491 | INFO | main:get_args:213 - log dir: ./exp/0125/car/3dbf3ah_hvae_lion_B40
2023-01-25 00:17:56.491 | INFO | main::227 - In Rank=0
2023-01-25 00:17:56.491 | INFO | main::233 - Node rank 0, local proc 0, global proc 0
2023-01-25 00:17:56.503 | INFO | main::227 - In Rank=1
2023-01-25 00:17:56.504 | INFO | main::233 - Node rank 0, local proc 1, global proc 1
2023-01-25 00:17:56.515 | INFO | main::227 - In Rank=2
2023-01-25 00:17:56.516 | INFO | main::233 - Node rank 0, local proc 2, global proc 2
2023-01-25 00:17:56.528 | INFO | main::227 - In Rank=3
2023-01-25 00:17:56.529 | INFO | main::233 - Node rank 0, local proc 3, global proc 3
2023-01-25 00:17:56.541 | INFO | main::241 - join 3
2023-01-25 00:17:56.651 | DEBUG | utils.utils:init_processes:1140 - set port as 6011
2023-01-25 00:17:56.652 | INFO | utils.utils:init_processes:1151 - init_process: rank=0, world_size=4
2023-01-25 00:17:56.663 | DEBUG | utils.utils:init_processes:1140 - set port as 6011
2023-01-25 00:17:56.664 | INFO | utils.utils:init_processes:1151 - init_process: rank=1, world_size=4
2023-01-25 00:17:56.679 | DEBUG | utils.utils:init_processes:1140 - set port as 6011
2023-01-25 00:17:56.680 | INFO | utils.utils:init_processes:1151 - init_process: rank=2, world_size=4
2023-01-25 00:17:56.715 | DEBUG | utils.utils:init_processes:1140 - set port as 6011
2023-01-25 00:17:56.716 | INFO | utils.utils:init_processes:1151 - init_process: rank=3, world_size=4
2023-01-25 00:17:57.827 | INFO | main:main:29 - use trainer: trainers.hvae_trainer
2023-01-25 00:17:57.831 | INFO | main:main:29 - use trainer: trainers.hvae_trainer
2023-01-25 00:17:57.832 | INFO | main:main:29 - use trainer: trainers.hvae_trainer
2023-01-25 00:17:57.836 | INFO | main:main:29 - use trainer: trainers.hvae_trainer
2023-01-25 00:18:01.625 | INFO | utils.utils:common_init:466 - [common-init] at rank=2, seed=1
2023-01-25 00:18:01.626 | INFO | utils.utils:init:339 - rank=2, init writer as a blackhole
2023-01-25 00:18:01.626 | INFO | utils.utils:common_init:510 - [common-init] DONE
2023-01-25 00:18:01.670 | INFO | utils.utils:common_init:466 - [common-init] at rank=3, seed=1
2023-01-25 00:18:01.671 | INFO | utils.utils:init:339 - rank=3, init writer as a blackhole
2023-01-25 00:18:01.671 | INFO | utils.utils:common_init:510 - [common-init] DONE
2023-01-25 00:18:01.691 | INFO | utils.utils:common_init:466 - [common-init] at rank=0, seed=1
2023-01-25 00:18:01.691 | INFO | utils.utils:common_init:466 - [common-init] at rank=1, seed=1
2023-01-25 00:18:01.692 | INFO | utils.utils:init:331 - Not init TFB
2023-01-25 00:18:01.692 | INFO | utils.utils:init:339 - rank=1, init writer as a blackhole
2023-01-25 00:18:01.692 | INFO | utils.utils:common_init:510 - [common-init] DONE
2023-01-25 00:18:01.693 | INFO | utils.utils:common_init:510 - [common-init] DONE
2023-01-25 00:18:06.292 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-01-25 00:18:06.292 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-01-25 00:18:06.293 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-01-25 00:18:06.298 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-01-25 00:18:06.308 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-01-25 00:18:06.309 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-01-25 00:18:06.311 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-01-25 00:18:06.313 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-01-25 00:18:06.314 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-01-25 00:18:06.317 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-01-25 00:18:06.318 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-01-25 00:18:06.318 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-01-25 00:18:06.321 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-01-25 00:18:06.329 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-01-25 00:18:06.329 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-01-25 00:18:06.332 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-01-25 00:18:06.457 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.458 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1
2023-01-25 00:18:06.458 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-01-25 00:18:06.473 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.474 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1
2023-01-25 00:18:06.474 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-01-25 00:18:06.478 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.479 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1
2023-01-25 00:18:06.479 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-01-25 00:18:06.505 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.505 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1
2023-01-25 00:18:06.505 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-01-25 00:18:06.594 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.610 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.613 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.640 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-01-25 00:18:06.655 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:2
2023-01-25 00:18:06.663 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:1
2023-01-25 00:18:06.669 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:3
2023-01-25 00:18:06.689 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build
2023-01-25 00:18:06.689 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:2
2023-01-25 00:18:06.696 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:0
2023-01-25 00:18:06.704 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build
2023-01-25 00:18:06.704 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:3
2023-01-25 00:18:06.705 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build
2023-01-25 00:18:06.705 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:1
2023-01-25 00:18:06.728 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build
2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:0
2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:0
2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:2
2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:1
2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:3
2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data
2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data
2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data
2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data
2023-01-25 00:18:09.476 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
2023-01-25 00:18:09.477 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:09.478 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
2023-01-25 00:18:09.478 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:09.487 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train
2023-01-25 00:18:09.487 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train
2023-01-25 00:18:09.619 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
2023-01-25 00:18:09.619 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:09.626 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train
2023-01-25 00:18:09.781 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
2023-01-25 00:18:09.781 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:09.787 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train
2023-01-25 00:18:11.014 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.5s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458
2023-01-25 00:18:11.125 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.5s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458
2023-01-25 00:18:11.149 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.7s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458
2023-01-25 00:18:11.199 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.4s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458
2023-01-25 00:18:12.484 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924]
2023-01-25 00:18:12.618 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924]
2023-01-25 00:18:12.801 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924]
2023-01-25 00:18:12.810 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924]
2023-01-25 00:18:13.351 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048
2023-01-25 00:18:13.375 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:13.376 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val
2023-01-25 00:18:13.450 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048
2023-01-25 00:18:13.479 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:13.480 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val
2023-01-25 00:18:13.534 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352
2023-01-25 00:18:13.615 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048
2023-01-25 00:18:13.623 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048
2023-01-25 00:18:13.639 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:13.640 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val
2023-01-25 00:18:13.646 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False
2023-01-25 00:18:13.647 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val
2023-01-25 00:18:13.648 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352
2023-01-25 00:18:13.676 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048
2023-01-25 00:18:13.677 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1
2023-01-25 00:18:13.683 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:2
2023-01-25 00:18:13.794 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048
2023-01-25 00:18:13.795 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1
2023-01-25 00:18:13.801 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:0
2023-01-25 00:18:13.842 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352
2023-01-25 00:18:13.863 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352
2023-01-25 00:18:14.004 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048
2023-01-25 00:18:14.005 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1
2023-01-25 00:18:14.010 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:3
2023-01-25 00:18:14.040 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048
2023-01-25 00:18:14.042 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1
2023-01-25 00:18:14.053 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:1
2023-01-25 00:18:14.394 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15
2023-01-25 00:18:14.655 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15
2023-01-25 00:18:14.924 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15
2023-01-25 00:18:14.959 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15
2023-01-25 00:18:15.220 | INFO | trainers.base_trainer:prepare_vis_data:691 - tr_x: torch.Size([16, 2048, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 2048, 3])
2023-01-25 00:18:15.247 | INFO | main:main:46 - param size = 22.402731M
2023-01-25 00:18:15.249 | INFO | main:main:68 - not find any checkpoint: ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints, (exist=False), or snapshot ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints/snapshot, (exist=False)
2023-01-25 00:18:15.250 | INFO | trainers.base_trainer:train_epochs:173 - [rank=2] Start epoch: 0 End epoch: 800, batch-size=40 | Niter/epo=15 | log freq=15, viz freq 6000, val freq 200
2023-01-25 00:18:15.580 | INFO | trainers.base_trainer:prepare_vis_data:691 - tr_x: torch.Size([16, 2048, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 2048, 3])
2023-01-25 00:18:15.614 | INFO | main:main:46 - param size = 22.402731M
2023-01-25 00:18:15.615 | INFO | trainers.base_trainer:set_writer:57 -

./exp/0125/car/3dbf3ah_hvae_lion_B40

Zhiyuan-R · 2023-01-25T19:51:57Z

And below is my config

bash_name: ''
clipforge:
clip_model: ViT-B/32
enable: 0
feat_dim: 512
cmt: lion
comet_key: ''
data:
batch_size: 40
batch_size_test: 10
cates: car
clip_forge_enable: 0
clip_model: ViT-B/32
cond_on_cat: 0
cond_on_voxel: 0
data_dir: data/ShapeNetCore.v2.PC15k
dataset_scale: 1
dataset_type: shapenet15k
eval_test_split: 0
input_dim: -1
is_encode_whole_dataset_trainer: 0
nclass: 55
noise_std: 0.1
noise_std_min: -1.0
noise_type: normal
normalize_global: true
normalize_per_shape: false
normalize_range: false
normalize_shape_box: false
normalize_std_per_axis: false
num_workers: 4
random_subsample: 1
recenter_per_shape: false
sample_with_replacement: 1
te_max_sample_points: 2048
tr_max_sample_points: 2048
train_drop_last: 1
type: datasets.pointflow_datasets
voxel_size: 0.1
ddpm:
add_point_feat: true
attn:

0
1
0
0
beta_1: 0.0001
beta_T: 0.02
clip_denoised: 0
ddim_step: 200
dropout: 0.1
ema: 0
input_dim: 3
loss_type: l1_sum
loss_type_0: ''
loss_weight_cdnorm: 1.0
loss_weight_emd: 1.0
model_mean_type: eps
model_var_type: fixedlarge
ncenter:
1024
256
64
16
num_layers_classifier: 3
num_steps: 1
p2_gamma: 1.0
p2_k: 1.0
sched_mode: linear
time_dim: 64
use_bn: true
use_global_attn: 0
use_gn: false
use_new_timeemb: 0
use_p2_weight: 0
with_se: 0
dpm:
train_encoder_only: 0
dpm_ckpt: ''
eval:
load_other_vae_ckpt: 0
need_denoise: 0
eval_ddim_step: 0
eval_trainnll: 0
exp_name: ''
has_shapelatent: 1
hash: 3dbf3ah
latent_pts:
ada_mlp_init_scale: 0.1
decoder_layer_out_dim: 32
encoder_layer_out_dim: 32
hid: 64
latent_dim_ext:
64
mask_out_extra_latent: 0
normalization: bn
pts_sigma_offset: 0.0
pvd_mse_loss: 0
skip_weight: 0.01
style_dim: 128
style_encoder: models.shapelatent_modules.PointNetPlusEncoder
style_mlp: ''
style_prior: models.score_sde.resnet.PriorSEDrop
use_linear_for_adagn: 0
weight_kl_feat: 1.0
weight_kl_glb: 1.0
weight_kl_pt: 1.0
log_dir: ./exp/0125/car/3dbf3ah_hvae_lion_B40
log_name: ./exp/0125/car/3dbf3ah_hvae_lion_B40
model_config: default
ngpu: 1
num_ref: 0
num_val_samples: 16
save_dir: ./exp/0125/car/3dbf3ah_hvae_lion_B40
sde:
attn_mhead: 0
attn_mhead_local: -1
autocast_train: false
beta_end: 20.0
beta_start: 0.1
bound_mlogit: 0
bound_mlogit_value: -5.42
condition_add: 1
condition_cat: 0
cont_kl_anneal: true
dae_checkpoint: ''
dataset: shape
ddim_kappa: 1.0
ddim_skip_type: uniform
denoising_stddevs: beta
diffusion_steps: 1000
drop_inactive_var: 0
dropout: 0.2
ema_decay: 0.9999
embedding_dim: 128
embedding_scale: 1.0
embedding_type: positional
epochs: 800
fir: false
global_prior_ckpt: ''
grad_clip_max_norm: 0.0
hier_prior: 0
hypara_mixing_logit: 0
init_t: 1.0
is_continues: 0
iw_sample_p: ll_iw
iw_sample_q: reweight_p_samples
iw_subvp_like_vp_sde: false
jac_reg_coeff: 0
jac_reg_freq: 1
kin_reg_coeff: 0
kl_anneal_portion_vada: 0.5
kl_balance_vada: false
kl_const_coeff_vada: 1.0e-07
kl_const_portion_vada: 0.0
kl_max_coeff_vada: 0.5
learn_mixing_logit: 1
learning_rate_dae: 0.0003
learning_rate_dae_local: 0.0003
learning_rate_min_dae: 0.0003
learning_rate_min_dae_local: 0.0003
learning_rate_min_vae: 1.0e-05
learning_rate_mlogit: -1.0
learning_rate_vae: 0.0001
local_prior: same_as_global
mixed_prediction: false
mixing_logit_init: -6
nhead: 4
num_cell_per_scale_dae: 8
num_cell_per_scale_dae_local: 0
num_channels_dae: 256
num_latent_scales: 1
num_preprocess_blocks: 2
num_scales_dae: 2
ode_eps: 1.0e-05
ode_sample: 0
pool_feat_cat: 0
pos_embed: none
prior_model: models.latent_points_ada_localprior.PVCNN2Prior
progressive: none
progressive_combine: sum
progressive_input: none
regularize_mlogit: 0
regularize_mlogit_margin: 0.0
sde_type: vpsde
share_mlogit: 0
sigma2_0: 0.0
sigma2_max: 0.99
sigma2_min: 0.0001
time_emb_scales: 1.0
time_eps: 0.01
train_dae: 1
train_ode_solver_tol: 1.0e-05
train_vae: true
update_q_ema: false
use_adam: true
use_adamax: false
vae_checkpoint: ''
warmup_epochs: 20
weight_decay: 0.0003
weight_decay_norm_dae: 0.0
weight_decay_norm_vae: 0.0
set_detect_anomaly: 0
shapelatent:
decoder_num_points: 2048
decoder_type: models.latent_points_ada.LatentPointDecPVC
encoder_type: models.latent_points_ada.PointTransPVC
eps_z_global_only: 1
freeze_vae: 0
kl_weight: 0.5
latent_dim: 1
local_emb_agg: mean
log_sigma_offset: 6.0
loss0_weight: 1.0
model: models.vae_adain
prior_type: normal
residual: 1
snapshot_min: 30
test_size: 660
trainer:
anneal_kl: 1
apply_loss_weight_1_kl: 0
epochs: 800
kl_balance: 0
kl_free:
0
0
kl_ratio:
1.0
1.0
kl_ratio_apply: 0
loss1_weight_anneal_v: quad
opt:
beta1: 0.9
beta2: 0.99
ema_decay: 0.9999
grad_clip: -1.0
lr: 0.001
lr_min: 0.0001
momentum: 0.9
scheduler: ''
start_ratio: 0.6
step_decay: 0.998
type: adam
vae_lr_warmup_epochs: 0
weight_decay: 0.0
rec_balance: 0
seed: 1
sn_reg_vae: 0
sn_reg_vae_weight: 0.0
type: trainers.hvae_trainer
use_grad_scalar: 0
use_kl_free: 0
warmup_epochs: 0
use_checkpoint: 0
vis_latent_point: 0
viz:
log_freq: -1
save_freq: 2000
val_freq: 200
vis_sample_ddim_step: 0
viz_freq: -400
viz_order:
2
0
1
voxel2pts:
diffusion_steps:
0
init_weight: ''
weight_recont: 1.0

ZENGXH · 2023-01-25T21:50:14Z

Hi, I try with VAE training using batch-size 40 on 4 gpus: I also get similar NaN issue. However, the same training code works with batch-size 32. It's not clear to me what's the reason, it seems the training does not work with batch-size > 40 somehow.
While I am thinking about this, perhaps you can try using batch-size as 32 for now? Sorry about that!

Zhiyuan-R · 2023-01-25T21:53:16Z

Thanks for your hard working! I cannot believe you run it yourself! It is so nice of you! Have a good night!

ZENGXH closed this as completed Jan 25, 2023

supriya-gdptl mentioned this issue Jun 14, 2023

NaN loss while training stage 1 VAE #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training loss nan #9

training loss nan #9

Zhiyuan-R commented Jan 25, 2023

ZENGXH commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023

ZENGXH commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023

training loss nan #9

training loss nan #9

Comments

Zhiyuan-R commented Jan 25, 2023

ZENGXH commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023

./exp/0125/car/3dbf3ah_hvae_lion_B40

Zhiyuan-R commented Jan 25, 2023

ZENGXH commented Jan 25, 2023

Zhiyuan-R commented Jan 25, 2023