Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation script error: Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() #9

Open
owoshch opened this issue Aug 4, 2021 · 5 comments

Comments

@owoshch
Copy link

owoshch commented Aug 4, 2021

Hi!

I'm trying to reproduce the validation results from your work using the validation script for pytorch. I changed the path to the dataset and ran the command sh ./scripts/release/dsnet/val_dsnet_pytorch_dist_custom.sh

It stats to execute correctly, but at 7th of 4071 steps of validation it freezes and outputs:

fname=08/velodyne/000007.bin, ins_num=29]> /home/fkitashov/Documents/repositories/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() -> if self.data_mode == 'offset': (Pdb)

Have you ever encountered such a problem? If so, how did you resolve it? Thank you

Terminal output:

/home/fkitashov/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : cfg_train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_c3imu6dz/none_n8t9sofz
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/fkitashov/anaconda3/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_c3imu6dz/none_n8t9sofz/attempt_0/0/error.json
2021-08-04 18:59:39,727 INFO Start logging
2021-08-04 18:59:39,727 INFO CUDA_VISIBLE_DEVICES=ALL
2021-08-04 18:59:39,727 INFO total_batch_size: 1
2021-08-04 18:59:39,727 INFO config cfgs/release/dsnet_custom.yaml
2021-08-04 18:59:39,727 INFO ckpt_name PolarOffset.pth
2021-08-04 18:59:39,727 INFO launcher pytorch
2021-08-04 18:59:39,727 INFO batch_size 1
2021-08-04 18:59:39,727 INFO tcp_port 12345
2021-08-04 18:59:39,727 INFO local_rank 0
2021-08-04 18:59:39,727 INFO sync_bn False
2021-08-04 18:59:39,727 INFO tag val_dsnet_pytorch_dist
2021-08-04 18:59:39,727 INFO onlyval True
2021-08-04 18:59:39,727 INFO saveval False
2021-08-04 18:59:39,727 INFO onlytest False
2021-08-04 18:59:39,727 INFO pretrained_ckpt pretrained_weight/dsnet_pretrain_pq_0.577.pth
2021-08-04 18:59:39,727 INFO nofix False
2021-08-04 18:59:39,727 INFO fix_semantic_instance True
2021-08-04 18:59:39,727 INFO cfg.ROOT_DIR: /home/fkitashov/Documents/repositories/DS-Net
2021-08-04 18:59:39,728 INFO cfg.LOCAL_RANK: 0
2021-08-04 18:59:39,728 INFO
cfg.DATA_CONFIG = edict()
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATASET_NAME: SemanticKitti
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATASET_PATH: /datasets/KITTI/dataset/
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.NCLASS: 20
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.RETURN_REF: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.RETURN_INS_ID: True
2021-08-04 18:59:39,728 INFO
cfg.DATA_CONFIG.DATALOADER = edict()
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.VOXEL_TYPE: Spherical
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.GRID_SIZE: [480, 360, 32]
2021-08-04 18:59:39,728 INFO
cfg.DATA_CONFIG.DATALOADER.AUGMENTATION = edict()
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.ROTATE: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.FLIP: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.TRANSFORM: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.TRANSFORM_STD: [0.1, 0.1, 0.1]
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.SCALE: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.IGNORE_LABEL: 255
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.CONVERT_IGNORE_LABEL: 0
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.FIXED_VOLUME_SPACE: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.MAX_VOLUME_SPACE: [50, 3.141592653589793, 1.5]
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.MIN_VOLUME_SPACE: [3, -3.141592653589793, -3]
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.CENTER_TYPE: Axis_center
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.DATA_DIM: 9
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.NUM_WORKER: 1
2021-08-04 18:59:39,728 INFO
cfg.OPTIMIZE = edict()
2021-08-04 18:59:39,728 INFO cfg.OPTIMIZE.LR: 0.002
2021-08-04 18:59:39,728 INFO cfg.OPTIMIZE.MAX_EPOCH: 50
2021-08-04 18:59:39,728 INFO
cfg.MODEL = edict()
2021-08-04 18:59:39,728 INFO cfg.MODEL.NAME: PolarOffsetSpconvPytorchMeanshift
2021-08-04 18:59:39,728 INFO
cfg.MODEL.MODEL_FN = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.PT_POOLING: max
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.MAX_PT_PER_ENCODE: 256
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.PT_SELECTION: random
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.FEATURE_COMPRESSION: 16
2021-08-04 18:59:39,729 INFO
cfg.MODEL.VFE = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.VFE.NAME: PointNet
2021-08-04 18:59:39,729 INFO cfg.MODEL.VFE.OUT_CHANNEL: 64
2021-08-04 18:59:39,729 INFO
cfg.MODEL.BACKBONE = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.BACKBONE.NAME: Spconv_salsaNet_res_cfg
2021-08-04 18:59:39,729 INFO cfg.MODEL.BACKBONE.INIT_SIZE: 32
2021-08-04 18:59:39,729 INFO
cfg.MODEL.SEM_HEAD = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.SEM_HEAD.NAME: Spconv_sem_logits_head_cfg
2021-08-04 18:59:39,729 INFO
cfg.MODEL.INS_HEAD = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.INS_HEAD.NAME: Spconv_ins_offset_concatxyz_threelayers_head_cfg
2021-08-04 18:59:39,729 INFO cfg.MODEL.INS_HEAD.EMBEDDING_CHANNEL: 3
2021-08-04 18:59:39,729 INFO
cfg.MODEL.MEANSHIFT = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.NAME: pytorch_meanshift
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.BANDWIDTH: [0.2, 1.7, 3.2]
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.ITERATION: 4
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.DATA_MODE: offset
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.SHIFT_MODE: matrix_flat_kernel_bandwidth_weight
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.DOWNSAMPLE_MODE: xyz
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.POINT_NUM_TH: 10000
2021-08-04 18:59:39,729 INFO cfg.MODEL.SEM_LOSS: Lovasz_loss
2021-08-04 18:59:39,729 INFO cfg.MODEL.INS_LOSS: offset_loss_regress_vec
2021-08-04 18:59:39,729 INFO
cfg.MODEL.POST_PROCESSING = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.POST_PROCESSING.CLUSTER_ALGO: MeanShift_embedding_cluster
2021-08-04 18:59:39,729 INFO cfg.MODEL.POST_PROCESSING.BANDWIDTH: 0.65
2021-08-04 18:59:39,729 INFO cfg.MODEL.POST_PROCESSING.MERGE_FUNC: merge_ins_sem
2021-08-04 18:59:39,729 INFO cfg.DIST_TRAIN: True
2021-08-04 18:59:39,731 INFO Building dataloader for val set.
2021-08-04 18:59:39,765 INFO Flip Augmentation: False
2021-08-04 18:59:39,765 INFO Scale Augmentation: False
2021-08-04 18:59:39,765 INFO Transform Augmentation: False
2021-08-04 18:59:39,765 INFO Rotate Augmentation: False
2021-08-04 18:59:39,765 INFO Shuffle: False
2021-08-04 18:59:41,624 INFO ==> Loading parameters from pre-trained checkpoint pretrained_weight/dsnet_pretrain_pq_0.577.pth to CPU
2021-08-04 18:59:42,101 INFO Freezing backbone, semantic and instance part of the model.
2021-08-04 18:59:42,101 INFO Not using lr scheduler
2021-08-04 18:59:42,135 INFO DistributedDataParallel(
(module): PolarOffsetSpconvPytorchMeanshift(
(fea_compression): Sequential(
(0): Linear(in_features=64, out_features=16, bias=True)
(1): ReLU()
)
(backbone): Spconv_salsaNet_res_cfg(
(downCntx): ResContextBlock(
(conv1): SubMConv3d()
(bn0): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01)
(conv1_2): SubMConv3d()
(bn0_2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1_2): LeakyReLU(negative_slope=0.01)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(resBlock2): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(resBlock3): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(resBlock4): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(resBlock5): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(upBlock0): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(upBlock1): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(upBlock2): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(upBlock3): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(ReconNet): ReconBlock(
(conv1): SubMConv3d()
(bn0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): Sigmoid()
(conv1_2): SubMConv3d()
(bn0_2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1_2): Sigmoid()
(conv1_3): SubMConv3d()
(bn0_3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1_3): Sigmoid()
)
)
(sem_head): Spconv_sem_logits_head_cfg(
(logits): SubMConv3d()
)
(ins_head): Spconv_ins_offset_concatxyz_threelayers_head_cfg(
(conv1): SubMConv3d()
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01)
(conv2): SubMConv3d()
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01)
(conv3): SubMConv3d()
(bn3): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act3): LeakyReLU(negative_slope=0.01)
(offset): Sequential(
(0): Linear(in_features=35, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(offset_linear): Linear(in_features=32, out_features=3, bias=True)
)
(vfe_model): PointNet(
(PPmodel): Sequential(
(0): BatchNorm1d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=9, out_features=64, bias=True)
(2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Linear(in_features=64, out_features=128, bias=True)
(5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): ReLU()
(7): Linear(in_features=128, out_features=256, bias=True)
(8): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(9): ReLU()
(10): Linear(in_features=256, out_features=64, bias=True)
)
)
(sem_loss): CrossEntropyLoss()
(pytorch_meanshift): PytorchMeanshift(
(learnable_bandwidth_weights_layer_list): ModuleList(
(0): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
(1): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
(2): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
(3): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
)
)
)
)
2021-08-04 18:59:42.231665: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 18:59:42,821 INFO Start Training
2021-08-04 18:59:42,822 INFO ----EPOCH -1 Evaluating----
New evaluator with min_points of 50
New evaluator with min_points of 50
0%|▏ | 8/4071 [00:12<1:33:04, 1.37s/it, loss=3.99, fname=08/velodyne/000007.bin, ins_num=29]> /home/fkitashov/Documents/repositories/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight()
-> if self.data_mode == 'offset':
(Pdb)
(Pdb)

@hongfz16
Copy link
Owner

hongfz16 commented Aug 8, 2021

Hi. Sorry for the late reply. Thank you for your interest in our work.
I did not run into this error before. But I think there may be some variable turn into nan in this line. Maybe you could try to trace back to where the nan appears using the pdb break point.

@starnstar
Copy link

Hi. I had the same error. When I set the --pretrained_ckpt=dsnet_pretrain_pq_0.577.pth in both train*/val*/test*.sh, it returned the error:

(Pdb) > /DS-Net/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight()
-> if self.data_mode == 'offset':

I‘d be very grateful if you can tell me how to solve it. Thanks a lot.

@hamin-song
Copy link

Hello, I have the same problem as you.

0%| | 2/1018 [00:03<25:40, 1.52s/it, loss=4.09, fname=08/velodyne/000004.bin, ins_num=36]> /home/user/Desktop/panopticSeg/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight()
-> if self.data_mode == 'offset':

If anyone has solved it, please let me know how...

@jasong-ovo
Copy link

Hello, I met the same problem too when I ran "bash scripts/release/dsnet/train_dsnet_slurm_dist_ii.sh".

File "/mnt/cache/gongjunchao/workdir/DS-Net/network/modules/pytorch_meanshift.py", line 92, in calc_shifted_matrix_flat_kernel_bandwidth_weight
new_X = torch.sum(torch.stack(new_X_list), dim=0) / torch.sum(weights, dim=1).view(-1)
(function _print_stack)

RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.
0%| | 0/2392 [01:18<?, ?it/s]

I'd like to know how to fix this bug. Thanks!

@jasong-ovo
Copy link

Hello, I found this bug is caused by numerical error between operator "**" and "torch.mm" in my case.
To fix it, I changed function "pariwise_distance" in network/loss/instance_losses.py.

image

I hope this could help you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants