Runtime failure with customize dataset #27

robinren03 · 2023-12-19T15:43:50Z

I have built a customized dataset with only 2 classes of labels and 9999 images in trainging set, 108 in val set. And it is in COCO format. But when running with the following commands,

bash tools/dist_train.sh ~/ViTDet/configs/ViTDet/ViTDet-ViTAE-Base-100e.py 2 --cfg-options model.pretrained=/home/robin/ViTDet/ViTAE-Base-GPU.pth

It gives out the following error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 256, 16, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Further inspected, I got the tracing and the very last part of it goes as followses.

File "/home/robin/ViTDet/mmdet/models/detectors/two_stage.py", line 135, in forward_train                                                                                                            
    rpn_losses, proposal_list = self.rpn_head.forward_train(                                                                                                                                                       
File "/home/robin/ViTDet/mmdet/models/dense_heads/base_dense_head.py", line 321, in forward_train                                                                                                                 
outs = self(x)                                                                                                                                                                                                 
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/ViTDet/mmdet/models/dense_heads/anchor_head.py", line 171, in forward                                                                                                                           
return multi_apply(self.forward_single, feats)                                                                                                                                                                 
File "/home/robin/ViTDet/mmdet/core/utils/misc.py", line 30, in multi_apply                                                                                                                                       
return tuple(map(list, zip(*map_results)))                                                                                                                                                                     
File "/home/robin/ViTDet/mmdet/models/dense_heads/rpn_head.py", line 66, in forward_single                                                                                                                        
x = self.rpn_conv(x)                                                                                                                                                                                           
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward                                                                                             
input = module(input)                                                                                                                                                                                          
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/ViTDet/mmdet/models/utils/convModule_norm.py", line 27, in forward                                                                                                                              
x = self.activate(x)                                                                                                                                                                                           
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 102, in forward                                                                                            
return F.relu(input, inplace=self.inplace)                                                                                                                                                                     
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/functional.py", line 1457, in relu                                                                                                      
result = torch.relu(input)                                                                                                                                                                                     
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack                                                                                                 
return traceback.format_stack()                                                                                                                                                                               
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)

My config file goes as follows:

_base_ = '../_base_/default_runtime.py'
# dataset settings
dataset_type = 'CocoDataset'
classes = ('device', 'block')
data_root = '/home/ryanyu/ViTDet/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
image_size = (1024, 1024)

file_client_args = dict(backend='disk')
# comment out the code below to use different file client
# file_client_args = dict(
#     backend='petrel',
#     path_mapping=dict({
#         './data/': 's3://openmmlab/datasets/detection/',
#         'data/': 's3://openmmlab/datasets/detection/'
#     }))

train_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(
        type='Resize',
        img_scale=image_size,
        ratio_range=(0.1, 2.0),
        multiscale_mode='range',
        keep_ratio=True),
    dict(
        type='RandomCrop',
        crop_type='absolute_range',
        crop_size=image_size,
        recompute_bbox=True,
        allow_negative_crop=True),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size=image_size),  # padding to image_size leads 0.5+ mAP
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=1024),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

# Use RepeatDataset to speed up training
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='RepeatDataset',
        times=4,  # simply change this from 2 to 16 for 50e - 400e training.
        dataset=dict(
            type=dataset_type,
            classes=classes,
            ann_file=data_root + 'annotations/instances_train.json',
            img_prefix=data_root + 'train/',
            pipeline=train_pipeline)),
    val=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/instances_test.json',
        img_prefix=data_root + 'test/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/instances_test.json',
        img_prefix=data_root + 'test/',
        pipeline=test_pipeline))
evaluation = dict(interval=1, metric=['bbox'])

# optimizer assumes bs=64
optimizer = dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.00004)
optimizer_config = dict(grad_clip=None)

lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.067,
    step=[22, 24])
runner = dict(type='EpochBasedRunner', max_epochs=25)
norm_cfg = dict(type='LN', requires_grad=False)
# Use MMSyncBN that handles empty tensor in head. It can be changed to
# SyncBN after https://github.com/pytorch/pytorch/issues/36530 is fixed
# Requires MMCV-full after  https://github.com/open-mmlab/mmcv/pull/1205.
head_norm_cfg = dict(type='LN', requires_grad=False)

pretrained = None  # noqa
# model settings
model = dict(
    type='MaskRCNN',
    pretrained=pretrained,
    backbone=dict(
        type='ViTAE',
        img_size=1024,
        embed_dim=768,
        depth=12,
        num_heads=12,
        num_classes=2,
        mlp_ratio=4,
        qkv_bias=True,
        qk_scale=None,
        drop_rate=0.,
        attn_drop_rate=0.,
        drop_path_rate=0.1,
        use_abs_pos_emb=True,
        use_checkpoint=True
        ),
    neck=dict(
        type='FPN',
        in_channels=[768, 768, 768, 768],
        out_channels=256,
        norm_cfg=norm_cfg,
        use_residual=False,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        num_convs=2,
        norm_cfg=head_norm_cfg,
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[.0, .0, .0, .0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared4Conv1FCBBoxHead',
            conv_out_channels=256,
            norm_cfg=head_norm_cfg,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=2,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0., 0., 0., 0.],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
        mask_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        mask_head=dict(
            type='FCNMaskHead',
            num_convs=4,
            in_channels=256,
            conv_out_channels=256,
            num_classes=2,
            norm_cfg=head_norm_cfg,
            loss_mask=dict(
                type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))),
    # model training and testing settings
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100,
            mask_thr_binary=0.5)))

optimizer = dict(
    _delete_=True,
    type='AdamW',
    lr=0.0001,
    betas=(0.9, 0.999),
    weight_decay=0.1,
    constructor='LayerDecayOptimizerConstructor', 
                 paramwise_cfg=dict(
                        num_layers=12, 
                        layer_decay_rate=0.7,
                        custom_keys={
                            'bias': dict(decay_multi=0.),
                            'pos_embed': dict(decay_mult=0.),
                            'relative_position_bias_table': dict(decay_mult=0.),
                            'norm': dict(decay_mult=0.),
                            "rel_pos_h": dict(decay_mult=0.),
                            "rel_pos_w": dict(decay_mult=0.),
                            }
                            )
                 )
lr_config = dict(warmup_iters=250) # 16 * 1000 == 250 * 64

Thanks for assistance!

The text was updated successfully, but these errors were encountered:

robinren03 · 2023-12-19T15:55:43Z

My environment is as follows:
Pytorch version: 1.13.1
TorchVision: 0.14.0+cu117
OpenCV: 4.8.1
MMCV: 1.3.18 (with carefully modification to remove #include<HCN/HCN.h> and cherry-pick to make it compatible with Pytorch 11+)
MMCV Compiler: GCC 11.4
MMCV CUDA Compiler: 11.7
MMDetection: 2.18.0+19dd30a

robinren03 changed the title ~~with customize dataset~~ Runtime failure with customize dataset Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime failure with customize dataset #27

Runtime failure with customize dataset #27

robinren03 commented Dec 19, 2023 •

edited

Loading

robinren03 commented Dec 19, 2023

Runtime failure with customize dataset #27

Runtime failure with customize dataset #27

Comments

robinren03 commented Dec 19, 2023 • edited Loading

robinren03 commented Dec 19, 2023

robinren03 commented Dec 19, 2023 •

edited

Loading