Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert(context.shape[1] == self.num_points*self.context_dim) shapes don't match #51

Open
kg571852741 opened this issue Aug 24, 2023 · 4 comments

Comments

@kg571852741
Copy link

kg571852741 commented Aug 24, 2023

Hi @ZENGXH , Thanks for your hard work. I am testing out custom dataset with
(1076, 200000, 3), 200k size point cloud data. I've adjust few code line in pointflow_datasets.py. However, the final shape don't match in models/latent_points_ada.py: Any way to solve it or suggestions?

  context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
         # TODO: why do we need this??
        # self.train_points = self.all_points[:, :min(
        #     10000, self.all_points.shape[1])]  # subsample 15k points to 10k points per shape
        self.train_points = self.all_points[:, :min(
        200000, self.all_points.shape[1])]  # depercate 15k points to 10k points per shape
        self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape

self.te_sample_size = min(5000, te_sample_size)
and train_vae_sh settings

     shapelatent.decoder_num_points  100000 \
    data.tr_max_sample_points 100000 data.te_max_sample_points 100000 \

Revised few line codes

        # TODO: why do we need this??
        # self.train_points = self.all_points[:, :min(
        #     10000, self.all_points.shape[1])]  # subsample 15k points to 10k points per shape
        self.train_points = self.all_points[:, :min(
        200000, self.all_points.shape[1])]  # depercate 15k points to 10k points per shape
        self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape

        self.te_sample_size = min(5000, te_sample_size) 
2023-08-24 22:37:03.789 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:03.790 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:03.793 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:03.801 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:03.802 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:03.803 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:03.871 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:03.923 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:05.245 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/data_t_npy/
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/data__npy/; norm global=True, norm-box=False
2023-08-24 22:37:05.692 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/data__npy/house/train 
2023-08-24 22:37:06.622 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:37:10.636 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:37:14.391 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/data__npy/
2023-08-24 22:37:14.441 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/data__npy/; norm global=True, norm-box=False
2023-08-24 22:37:14.443 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/data__npy/house/val 
2023-08-24 22:37:14.560 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:37:14.905 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:37:14.918 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:37:14.920 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:37:15.186 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f2b5e36ae80>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-24 22:37:15.456 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-24 22:37:15.482 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:37:15.483 | INFO     | trainers.base_trainer:set_writer:57 - 
----------

----------
2023-08-24 22:37:15.487 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:37:15.488 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(370)vis_recont()
-> x_list.append(v[b])
(Pdb) ^C--KeyboardInterrupt--
(Pdb) q
2023-08-24 22:37:40.372 | ERROR    | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (2820942), thread 'MainThread' (139833154426688):
Traceback (most recent call last):

  File "train_dist.py", line 251, in <module>
    utils.init_processes(0, size, main, args, config)
    │     │                 │     │     │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │     │                 │     │     └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    │     │                 │     └ <function main at 0x7f2d64c749d0>
    │     │                 └ 1
    │     └ <function init_processes at 0x7f2d64c6bc10>
    └ <module 'utils.utils' from '/home/bim-group/Documents/GitHub/LION/utils/utils.py'>

> File "/home/bim-group/Documents/GitHub/LION/utils/utils.py", line 1158, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    └ <function main at 0x7f2d64c749d0>

  File "train_dist.py", line 86, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7f2baf6ba670>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 242, in train_epochs
    self.vis_recont(logs_info, writer, step)
    │    │          │          │       └ 0
    │    │          │          └ <utils.utils.Writer object at 0x7f2bacb39be0>
    │    │          └ {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1.5042e+00, 8.3240e+00, 1.5077e-01,
    │    │                     3.8869e-02, 3.8...
    │    └ <function BaseTrainer.vis_recont at 0x7f2baf6ba8b0>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>, {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1...
           └ <function BaseTrainer.vis_recont at 0x7f2baf6ba820>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
           │    │             └ <frame at 0x5561ae20aaa0, file '/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py', line 370, code vis_recont>
           │    └ <function Bdb.dispatch_line at 0x7f2d69d9e550>
           └ <pdb.Pdb object at 0x7f2b5e30d370>
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
       │    │               └ <class 'bdb.BdbQuit'>
       │    └ True
       └ <pdb.Pdb object at 0x7f2b5e30d370>

bdb.BdbQuit
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Uploading 1 metrics, params and output messages
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 347, in accept
    return self._accept(callback)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 384, in _accept
    callback(list_to_sent)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 511, in _send_stdout_messages_batch
    self._process_rest_api_send(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 591, in _process_rest_api_send
    sender(**kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 3231, in send_stdout_batch
    self.post_from_endpoint(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2031, in post_from_endpoint
    return self._result_from_http_method(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2053, in _result_from_http_method
    return method(url, payload, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2134, in post
    return super(RestApiClient, self).post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 1988, in post
    response = self.low_level_api_client.post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 536, in post
    return self.do(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 639, in do
    response = session.request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ ^C
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-24 22:37:47.706 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0824/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:214 - save config at ../exp/0824/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:217 - log dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-24 22:37:47.737 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.118s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-24 22:37:49.185 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1


2023-08-24 22:37:55.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:55.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:55.501 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:55.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:55.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:55.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:55.557 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:55.557 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:55.558 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:55.609 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:56.937 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/transform_buildingnet_npy/
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/transform_buildingnet_npy/; norm global=True, norm-box=False
2023-08-24 22:37:57.509 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/transform_buildingnet_npy/house/train 
2023-08-24 22:37:58.454 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:38:02.066 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:38:04.353 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/transform_buildingnet_npy/
2023-08-24 22:38:04.396 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/transform_buildingnet_npy/; norm global=True, norm-box=False
2023-08-24 22:38:04.398 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/transform_buildingnet_npy/house/val 
2023-08-24 22:38:04.514 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:38:04.855 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:38:04.863 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:38:04.865 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:38:05.123 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f5f3a86d880>
tr_x[-1].shape:  torch.Size([1, 10000, 3])
2023-08-24 22:38:05.383 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 10000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 10000, 3])
2023-08-24 22:38:05.396 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:38:05.397 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:38:05.398 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:38:05.399 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
> /home/bim-group/Documents/GitHub/LION/models/latent_points_ada.py(279)forward()
-> assert(context.shape[1] == self.num_points*self.context_dim)
(Pdb) 
```2023-08-24 22:37:03.789 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:03.790 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:03.793 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:03.801 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:03.802 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:03.803 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:03.871 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:03.923 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:05.245 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/_npy/; norm global=True, norm-box=False
2023-08-24 22:37:05.692 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/_npy/house/train 
2023-08-24 22:37:06.622 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:37:10.636 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:37:14.391 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/npy/
2023-08-24 22:37:14.441 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/t_npy/; norm global=True, norm-box=False
2023-08-24 22:37:14.443 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/_npy/house/val 
2023-08-24 22:37:14.560 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:37:14.905 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:37:14.918 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:37:14.920 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:37:15.186 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f2b5e36ae80>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-24 22:37:15.456 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-24 22:37:15.482 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:37:15.483 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/75ce6d1e28c3496c9b264a8567167fcc
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:37:15.487 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:37:15.488 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(370)vis_recont()
-> x_list.append(v[b])
(Pdb) ^C--KeyboardInterrupt--
(Pdb) q
2023-08-24 22:37:40.372 | ERROR    | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (2820942), thread 'MainThread' (139833154426688):
Traceback (most recent call last):

  File "train_dist.py", line 251, in <module>
    utils.init_processes(0, size, main, args, config)
    │     │                 │     │     │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │     │                 │     │     └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    │     │                 │     └ <function main at 0x7f2d64c749d0>
    │     │                 └ 1
    │     └ <function init_processes at 0x7f2d64c6bc10>
    └ <module 'utils.utils' from '/home/bim-group/Documents/GitHub/LION/utils/utils.py'>

> File "/home/bim-group/Documents/GitHub/LION/utils/utils.py", line 1158, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    └ <function main at 0x7f2d64c749d0>

  File "train_dist.py", line 86, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7f2baf6ba670>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 242, in train_epochs
    self.vis_recont(logs_info, writer, step)
    │    │          │          │       └ 0
    │    │          │          └ <utils.utils.Writer object at 0x7f2bacb39be0>
    │    │          └ {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1.5042e+00, 8.3240e+00, 1.5077e-01,
    │    │                     3.8869e-02, 3.8...
    │    └ <function BaseTrainer.vis_recont at 0x7f2baf6ba8b0>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>, {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1...
           └ <function BaseTrainer.vis_recont at 0x7f2baf6ba820>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
           │    │             └ <frame at 0x5561ae20aaa0, file '/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py', line 370, code vis_recont>
           │    └ <function Bdb.dispatch_line at 0x7f2d69d9e550>
           └ <pdb.Pdb object at 0x7f2b5e30d370>
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
       │    │               └ <class 'bdb.BdbQuit'>
       │    └ True
       └ <pdb.Pdb object at 0x7f2b5e30d370>

bdb.BdbQuit
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Uploading 1 metrics, params and output messages
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 347, in accept
    return self._accept(callback)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 384, in _accept
    callback(list_to_sent)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 511, in _send_stdout_messages_batch
    self._process_rest_api_send(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 591, in _process_rest_api_send
    sender(**kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 3231, in send_stdout_batch
    self.post_from_endpoint(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2031, in post_from_endpoint
    return self._result_from_http_method(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2053, in _result_from_http_method
    return method(url, payload, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2134, in post
    return super(RestApiClient, self).post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 1988, in post
    response = self.low_level_api_client.post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 536, in post
    return self.do(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 639, in do
    response = session.request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ ^C
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-24 22:37:47.706 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0824/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:214 - save config at ../exp/0824/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:217 - log dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-24 22:37:47.737 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.118s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-24 22:37:49.185 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1
COMET INFO: Experiment is live on comet.com https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0

2023-08-24 22:37:55.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:55.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:55.501 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:55.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:55.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:55.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:55.557 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:55.557 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:55.558 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:55.609 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:56.937 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: datanpy/
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/_npy/; norm global=True, norm-box=False
2023-08-24 22:37:57.509 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/py/house/train 
2023-08-24 22:37:58.454 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:38:02.066 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:38:04.353 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/npy/
2023-08-24 22:38:04.396 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/npy/; norm global=True, norm-box=False
2023-08-24 22:38:04.398 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/_npy/house/val 
2023-08-24 22:38:04.514 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:38:04.855 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:38:04.863 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:38:04.865 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:38:05.123 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f5f3a86d880>
tr_x[-1].shape:  torch.Size([1, 10000, 3])
2023-08-24 22:38:05.383 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 10000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 10000, 3])
2023-08-24 22:38:05.396 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:38:05.397 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:38:05.398 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:38:05.399 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
> /home/bim-group/Documents/GitHub/LION/models/latent_points_ada.py(279)forward()
-> assert(context.shape[1] == self.num_points*self.context_dim)
(Pdb) 
@ZENGXH
Copy link
Collaborator

ZENGXH commented Aug 24, 2023

could you try changing this two line

self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape
self.te_sample_size = min(5000, te_sample_size) 

to

self.tr_sample_size = tr_sample_size
self.te_sample_size = te_sample_size

I guess it's because the original code cap the maximum number of points as 10k (which is not necessary when using none point-flow dataset): as a result, the input for the model is 10k points instead of 100k points.

btw, I never try generating 100k points (usually we using 2048 points per shape) before: just curious, are you able to fit to the GPU memory?

@kg571852741
Copy link
Author

kg571852741 commented Aug 24, 2023

@ZENGXH Thanks for prompt reply! 👍

These are the training screenshots for generating 2048 (default setting, but with a changed data root path). As the decoding and latent point are set to 2048, the final results are unable to capture the model pattern.

step:0

recont-train (Step: 0)

step:134400

recont-train (Step: 134400)

Q: Able to fit to the GPU memory?

A: I ran all my tests (2048pt; batch 20 and 100k pts//1 batch) with 2 4090Ti GPUs settings (48GB memory), and the out-of-memory issues have not been found. :D

After chnaging to

self.tr_sample_size = tr_sample_size
self.te_sample_size = te_sample_size

x_list return [ ] from breakpoint () debugger set from trainers/base_trainer.py

            for k, v in output.items():
                if 'vis/' in k:
                    if b < x_0_pred.size(0):
                        x_list.append(x_0_pred[b])
                        name_list.append('pred')
                        print("vis_recont: ", k, v.shape)
                        print("x_0_pred: ", x_0_pred.shape)
                        print("x_0: ", x_0.shape)
                        print("x_t: ", x_t.shape)
                        breakpoint()
                    x_list.append(v[b])
                    name_list.append(k)

before with :

self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape
self.te_sample_size = min(5000, te_sample_size) 
  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

After changing log file

lion_env) bim-group@bimgroup-MS-7D70:~/Docu
![first-10- (Step: 5599)](https://github.com/nv-tlabs/LION/assets/39424493/bdddcdaa-9de9-471d-b74b-d2cd22c54034)
ments/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-25 05:10:49.602 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0825/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0825/house/21dd03h_hvae_lion_B1N100000
2023-08-25 05:10:49.609 | INFO     | __main__:get_args:214 - save config at ../exp/0825/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-25 05:10:49.609 | INFO     | __main__:get_args:217 - log dir: ../exp/0825/house/21dd03h_hvae_lion_B1N100000
2023-08-25 05:10:49.609 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-25 05:10:49.609 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-25 05:10:49.632 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.111s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-25 05:10:50.861 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1


2023-08-25 05:10:56.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-25 05:10:56.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-25 05:10:56.500 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-25 05:10:56.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-25 05:10:56.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-25 05:10:56.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-25 05:10:56.558 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-25 05:10:56.558 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-25 05:10:56.559 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-25 05:10:56.611 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-25 05:10:57.821 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-25 05:10:57.821 | INFO     | trainers.base_trainer:build_other_module:725 - no other module to build
2023-08-25 05:10:57.821 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-25 05:10:58.309 | INFO     | datasets.pointflow_datasets:get_datasets:400 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/transform_data/
2023-08-25 05:10:58.309 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/transform_data/; norm global=True, norm-box=False
2023-08-25 05:10:58.311 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1146] under: data/transform_data/house/train 
2023-08-25 05:10:59.056 | INFO     | datasets.pointflow_datasets:__init__:206 - [DATA] Load data time: 0.7s | dir: ['house'] | sample_with_replacement: 1; num points: 1146
2023-08-25 05:11:01.133 | INFO     | datasets.pointflow_datasets:__init__:272 - [DATA] normalize_global: mean=[-0.00376302 -0.07752005 -0.00340251], std=[0.2103262]
2023-08-25 05:11:02.244 | INFO     | datasets.pointflow_datasets:__init__:279 - [DATA] shape=(1146, 100000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.746, min=-2.361; num-pts=100000
searching: pointflow, get: data/transform_data/
2023-08-25 05:11:02.306 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/transform_data/; norm global=True, norm-box=False
2023-08-25 05:11:02.307 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/transform_data/house/val 
2023-08-25 05:11:02.423 | INFO     | datasets.pointflow_datasets:__init__:206 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-25 05:11:02.763 | INFO     | datasets.pointflow_datasets:__init__:279 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.454, min=-2.361; num-pts=100000
2023-08-25 05:11:02.772 | INFO     | datasets.pointflow_datasets:get_data_loaders:469 - [Batch Size] train=1, test=10; drop-last=1
2023-08-25 05:11:02.775 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-25 05:11:03.076 | INFO     | trainers.base_trainer:prepare_vis_data:685 - [prepare_vis_data] len of train_loader: 1146
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f6ea9aff430>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-25 05:11:03.385 | INFO     | trainers.base_trainer:prepare_vis_data:704 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-25 05:11:03.399 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-25 05:11:03.400 | INFO     | trainers.base_trainer:set_writer:57 - 
----------

----------
2023-08-25 05:11:03.403 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0825/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0825/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-25 05:11:03.403 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1146 | log freq=1146, viz freq 458400, val freq 200 
context.shape forward( torch.Size([1, 400000])
context.shape[1] forward( 400000
vis_recont:  vis/latent_pts torch.Size([1, 100000, 3])
x_0_pred:  torch.Size([1, 100000, 3])
x_0:  torch.Size([1, 100000, 3])
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(373)vis_recont()
-> x_list.append(v[b])
(Pdb) p b.shape
*** AttributeError: 'int' object has no attribute 'shape'
(Pdb) p b 
0

@aldinorizaldy
Copy link

Hi @kg571852741, I see your dataset is some sort of outdoor scene. Can you elaborate on how you prepare your custom dataset?

Thanks in advance.

@kg571852741
Copy link
Author

Hi @aldinorizaldy. Sorry for the late reply. The work was done a very long time ago and I really cannot remember the setting, but I think I followed the organized data folders structure with the 'cifar10' ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants