Validation script error: Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() #9

owoshch · 2021-08-04T16:04:08Z

Hi!

I'm trying to reproduce the validation results from your work using the validation script for pytorch. I changed the path to the dataset and ran the command sh ./scripts/release/dsnet/val_dsnet_pytorch_dist_custom.sh

It stats to execute correctly, but at 7th of 4071 steps of validation it freezes and outputs:

fname=08/velodyne/000007.bin, ins_num=29]> /home/fkitashov/Documents/repositories/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() -> if self.data_mode == 'offset': (Pdb)

Have you ever encountered such a problem? If so, how did you resolve it? Thank you

Terminal output:

/home/fkitashov/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : cfg_train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_c3imu6dz/none_n8t9sofz
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/fkitashov/anaconda3/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_c3imu6dz/none_n8t9sofz/attempt_0/0/error.json
2021-08-04 18:59:39,727 INFO Start logging
2021-08-04 18:59:39,727 INFO CUDA_VISIBLE_DEVICES=ALL
2021-08-04 18:59:39,727 INFO total_batch_size: 1
2021-08-04 18:59:39,727 INFO config cfgs/release/dsnet_custom.yaml
2021-08-04 18:59:39,727 INFO ckpt_name PolarOffset.pth
2021-08-04 18:59:39,727 INFO launcher pytorch
2021-08-04 18:59:39,727 INFO batch_size 1
2021-08-04 18:59:39,727 INFO tcp_port 12345
2021-08-04 18:59:39,727 INFO local_rank 0
2021-08-04 18:59:39,727 INFO sync_bn False
2021-08-04 18:59:39,727 INFO tag val_dsnet_pytorch_dist
2021-08-04 18:59:39,727 INFO onlyval True
2021-08-04 18:59:39,727 INFO saveval False
2021-08-04 18:59:39,727 INFO onlytest False
2021-08-04 18:59:39,727 INFO pretrained_ckpt pretrained_weight/dsnet_pretrain_pq_0.577.pth
2021-08-04 18:59:39,727 INFO nofix False
2021-08-04 18:59:39,727 INFO fix_semantic_instance True
2021-08-04 18:59:39,727 INFO cfg.ROOT_DIR: /home/fkitashov/Documents/repositories/DS-Net
2021-08-04 18:59:39,728 INFO cfg.LOCAL_RANK: 0
2021-08-04 18:59:39,728 INFO
cfg.DATA_CONFIG = edict()
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATASET_NAME: SemanticKitti
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATASET_PATH: /datasets/KITTI/dataset/
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.NCLASS: 20
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.RETURN_REF: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.RETURN_INS_ID: True
2021-08-04 18:59:39,728 INFO
cfg.DATA_CONFIG.DATALOADER = edict()
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.VOXEL_TYPE: Spherical
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.GRID_SIZE: [480, 360, 32]
2021-08-04 18:59:39,728 INFO
cfg.DATA_CONFIG.DATALOADER.AUGMENTATION = edict()
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.ROTATE: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.FLIP: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.TRANSFORM: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.TRANSFORM_STD: [0.1, 0.1, 0.1]
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.AUGMENTATION.SCALE: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.IGNORE_LABEL: 255
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.CONVERT_IGNORE_LABEL: 0
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.FIXED_VOLUME_SPACE: True
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.MAX_VOLUME_SPACE: [50, 3.141592653589793, 1.5]
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.MIN_VOLUME_SPACE: [3, -3.141592653589793, -3]
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.CENTER_TYPE: Axis_center
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.DATA_DIM: 9
2021-08-04 18:59:39,728 INFO cfg.DATA_CONFIG.DATALOADER.NUM_WORKER: 1
2021-08-04 18:59:39,728 INFO
cfg.OPTIMIZE = edict()
2021-08-04 18:59:39,728 INFO cfg.OPTIMIZE.LR: 0.002
2021-08-04 18:59:39,728 INFO cfg.OPTIMIZE.MAX_EPOCH: 50
2021-08-04 18:59:39,728 INFO
cfg.MODEL = edict()
2021-08-04 18:59:39,728 INFO cfg.MODEL.NAME: PolarOffsetSpconvPytorchMeanshift
2021-08-04 18:59:39,728 INFO
cfg.MODEL.MODEL_FN = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.PT_POOLING: max
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.MAX_PT_PER_ENCODE: 256
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.PT_SELECTION: random
2021-08-04 18:59:39,729 INFO cfg.MODEL.MODEL_FN.FEATURE_COMPRESSION: 16
2021-08-04 18:59:39,729 INFO
cfg.MODEL.VFE = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.VFE.NAME: PointNet
2021-08-04 18:59:39,729 INFO cfg.MODEL.VFE.OUT_CHANNEL: 64
2021-08-04 18:59:39,729 INFO
cfg.MODEL.BACKBONE = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.BACKBONE.NAME: Spconv_salsaNet_res_cfg
2021-08-04 18:59:39,729 INFO cfg.MODEL.BACKBONE.INIT_SIZE: 32
2021-08-04 18:59:39,729 INFO
cfg.MODEL.SEM_HEAD = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.SEM_HEAD.NAME: Spconv_sem_logits_head_cfg
2021-08-04 18:59:39,729 INFO
cfg.MODEL.INS_HEAD = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.INS_HEAD.NAME: Spconv_ins_offset_concatxyz_threelayers_head_cfg
2021-08-04 18:59:39,729 INFO cfg.MODEL.INS_HEAD.EMBEDDING_CHANNEL: 3
2021-08-04 18:59:39,729 INFO
cfg.MODEL.MEANSHIFT = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.NAME: pytorch_meanshift
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.BANDWIDTH: [0.2, 1.7, 3.2]
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.ITERATION: 4
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.DATA_MODE: offset
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.SHIFT_MODE: matrix_flat_kernel_bandwidth_weight
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.DOWNSAMPLE_MODE: xyz
2021-08-04 18:59:39,729 INFO cfg.MODEL.MEANSHIFT.POINT_NUM_TH: 10000
2021-08-04 18:59:39,729 INFO cfg.MODEL.SEM_LOSS: Lovasz_loss
2021-08-04 18:59:39,729 INFO cfg.MODEL.INS_LOSS: offset_loss_regress_vec
2021-08-04 18:59:39,729 INFO
cfg.MODEL.POST_PROCESSING = edict()
2021-08-04 18:59:39,729 INFO cfg.MODEL.POST_PROCESSING.CLUSTER_ALGO: MeanShift_embedding_cluster
2021-08-04 18:59:39,729 INFO cfg.MODEL.POST_PROCESSING.BANDWIDTH: 0.65
2021-08-04 18:59:39,729 INFO cfg.MODEL.POST_PROCESSING.MERGE_FUNC: merge_ins_sem
2021-08-04 18:59:39,729 INFO cfg.DIST_TRAIN: True
2021-08-04 18:59:39,731 INFO Building dataloader for val set.
2021-08-04 18:59:39,765 INFO Flip Augmentation: False
2021-08-04 18:59:39,765 INFO Scale Augmentation: False
2021-08-04 18:59:39,765 INFO Transform Augmentation: False
2021-08-04 18:59:39,765 INFO Rotate Augmentation: False
2021-08-04 18:59:39,765 INFO Shuffle: False
2021-08-04 18:59:41,624 INFO ==> Loading parameters from pre-trained checkpoint pretrained_weight/dsnet_pretrain_pq_0.577.pth to CPU
2021-08-04 18:59:42,101 INFO Freezing backbone, semantic and instance part of the model.
2021-08-04 18:59:42,101 INFO Not using lr scheduler
2021-08-04 18:59:42,135 INFO DistributedDataParallel(
(module): PolarOffsetSpconvPytorchMeanshift(
(fea_compression): Sequential(
(0): Linear(in_features=64, out_features=16, bias=True)
(1): ReLU()
)
(backbone): Spconv_salsaNet_res_cfg(
(downCntx): ResContextBlock(
(conv1): SubMConv3d()
(bn0): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01)
(conv1_2): SubMConv3d()
(bn0_2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1_2): LeakyReLU(negative_slope=0.01)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(resBlock2): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(resBlock3): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(resBlock4): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(resBlock5): ResBlock(
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1_2): SubMConv3d()
(act1_2): LeakyReLU(negative_slope=0.01)
(bn0_2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool): SparseConv3d()
)
(upBlock0): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(upBlock1): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(upBlock2): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(upBlock3): UpBlock(
(trans_dilao): SubMConv3d()
(trans_act): LeakyReLU(negative_slope=0.01)
(trans_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv1): SubMConv3d()
(act1): LeakyReLU(negative_slope=0.01)
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): SubMConv3d()
(act2): LeakyReLU(negative_slope=0.01)
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): SubMConv3d()
(act3): LeakyReLU(negative_slope=0.01)
(bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(up_subm): SparseInverseConv3d()
)
(ReconNet): ReconBlock(
(conv1): SubMConv3d()
(bn0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): Sigmoid()
(conv1_2): SubMConv3d()
(bn0_2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1_2): Sigmoid()
(conv1_3): SubMConv3d()
(bn0_3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1_3): Sigmoid()
)
)
(sem_head): Spconv_sem_logits_head_cfg(
(logits): SubMConv3d()
)
(ins_head): Spconv_ins_offset_concatxyz_threelayers_head_cfg(
(conv1): SubMConv3d()
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01)
(conv2): SubMConv3d()
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01)
(conv3): SubMConv3d()
(bn3): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act3): LeakyReLU(negative_slope=0.01)
(offset): Sequential(
(0): Linear(in_features=35, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(offset_linear): Linear(in_features=32, out_features=3, bias=True)
)
(vfe_model): PointNet(
(PPmodel): Sequential(
(0): BatchNorm1d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=9, out_features=64, bias=True)
(2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Linear(in_features=64, out_features=128, bias=True)
(5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): ReLU()
(7): Linear(in_features=128, out_features=256, bias=True)
(8): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(9): ReLU()
(10): Linear(in_features=256, out_features=64, bias=True)
)
)
(sem_loss): CrossEntropyLoss()
(pytorch_meanshift): PytorchMeanshift(
(learnable_bandwidth_weights_layer_list): ModuleList(
(0): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
(1): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
(2): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
(3): Sequential(
(0): Linear(in_features=32, out_features=32, bias=True)
(1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Linear(in_features=32, out_features=3, bias=True)
)
)
)
)
)
2021-08-04 18:59:42.231665: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 18:59:42,821 INFO Start Training
2021-08-04 18:59:42,822 INFO ----EPOCH -1 Evaluating----
New evaluator with min_points of 50
New evaluator with min_points of 50
0%|▏ | 8/4071 [00:12<1:33:04, 1.37s/it, loss=3.99, fname=08/velodyne/000007.bin, ins_num=29]> /home/fkitashov/Documents/repositories/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight()
-> if self.data_mode == 'offset':
(Pdb)
(Pdb)

The text was updated successfully, but these errors were encountered:

hongfz16 · 2021-08-08T09:29:42Z

Hi. Sorry for the late reply. Thank you for your interest in our work.
I did not run into this error before. But I think there may be some variable turn into nan in this line. Maybe you could try to trace back to where the nan appears using the pdb break point.

starnstar · 2022-02-14T08:58:37Z

Hi. I had the same error. When I set the --pretrained_ckpt=dsnet_pretrain_pq_0.577.pth in both train*/val*/test*.sh, it returned the error:

(Pdb) > /DS-Net/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight()
-> if self.data_mode == 'offset':

I‘d be very grateful if you can tell me how to solve it. Thanks a lot.

hamin-song · 2022-04-06T14:39:20Z

Hello, I have the same problem as you.

0%| | 2/1018 [00:03<25:40, 1.52s/it, loss=4.09, fname=08/velodyne/000004.bin, ins_num=36]> /home/user/Desktop/panopticSeg/DS-Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight()
-> if self.data_mode == 'offset':

If anyone has solved it, please let me know how...

jasong-ovo · 2022-06-17T06:20:35Z

Hello, I met the same problem too when I ran "bash scripts/release/dsnet/train_dsnet_slurm_dist_ii.sh".

File "/mnt/cache/gongjunchao/workdir/DS-Net/network/modules/pytorch_meanshift.py", line 92, in calc_shifted_matrix_flat_kernel_bandwidth_weight
new_X = torch.sum(torch.stack(new_X_list), dim=0) / torch.sum(weights, dim=1).view(-1)
(function _print_stack)

RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.
0%| | 0/2392 [01:18<?, ?it/s]

I'd like to know how to fix this bug. Thanks!

jasong-ovo · 2022-06-18T04:07:54Z

Hello, I found this bug is caused by numerical error between operator "**" and "torch.mm" in my case.
To fix it, I changed function "pariwise_distance" in network/loss/instance_losses.py.

I hope this could help you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation script error: Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() #9

Validation script error: Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() #9

owoshch commented Aug 4, 2021

hongfz16 commented Aug 8, 2021

starnstar commented Feb 14, 2022

hamin-song commented Apr 6, 2022

jasong-ovo commented Jun 17, 2022

jasong-ovo commented Jun 18, 2022

Validation script error: Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() #9

Validation script error: Net/network/modules/pytorch_meanshift.py(95)calc_shifted_matrix_flat_kernel_bandwidth_weight() #9

Comments

owoshch commented Aug 4, 2021

hongfz16 commented Aug 8, 2021

starnstar commented Feb 14, 2022

hamin-song commented Apr 6, 2022

jasong-ovo commented Jun 17, 2022

jasong-ovo commented Jun 18, 2022