Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime failure with customize dataset #27

Open
robinren03 opened this issue Dec 19, 2023 · 1 comment
Open

Runtime failure with customize dataset #27

robinren03 opened this issue Dec 19, 2023 · 1 comment

Comments

@robinren03
Copy link

robinren03 commented Dec 19, 2023

I have built a customized dataset with only 2 classes of labels and 9999 images in trainging set, 108 in val set. And it is in COCO format. But when running with the following commands,

bash tools/dist_train.sh ~/ViTDet/configs/ViTDet/ViTDet-ViTAE-Base-100e.py 2 --cfg-options model.pretrained=/home/robin/ViTDet/ViTAE-Base-GPU.pth

It gives out the following error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 256, 16, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Further inspected, I got the tracing and the very last part of it goes as followses.

File "/home/robin/ViTDet/mmdet/models/detectors/two_stage.py", line 135, in forward_train                                                                                                            
    rpn_losses, proposal_list = self.rpn_head.forward_train(                                                                                                                                                       
File "/home/robin/ViTDet/mmdet/models/dense_heads/base_dense_head.py", line 321, in forward_train                                                                                                                 
outs = self(x)                                                                                                                                                                                                 
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/ViTDet/mmdet/models/dense_heads/anchor_head.py", line 171, in forward                                                                                                                           
return multi_apply(self.forward_single, feats)                                                                                                                                                                 
File "/home/robin/ViTDet/mmdet/core/utils/misc.py", line 30, in multi_apply                                                                                                                                       
return tuple(map(list, zip(*map_results)))                                                                                                                                                                     
File "/home/robin/ViTDet/mmdet/models/dense_heads/rpn_head.py", line 66, in forward_single                                                                                                                        
x = self.rpn_conv(x)                                                                                                                                                                                           
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward                                                                                             
input = module(input)                                                                                                                                                                                          
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/ViTDet/mmdet/models/utils/convModule_norm.py", line 27, in forward                                                                                                                              
x = self.activate(x)                                                                                                                                                                                           
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl                                                                                            
return forward_call(*input, **kwargs)                                                                                                                                                                          
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 102, in forward                                                                                            
return F.relu(input, inplace=self.inplace)                                                                                                                                                                     
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/nn/functional.py", line 1457, in relu                                                                                                      
result = torch.relu(input)                                                                                                                                                                                     
File "/home/robin/anaconda3/envs/vit/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack                                                                                                 
return traceback.format_stack()                                                                                                                                                                               
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.) 

My config file goes as follows:

_base_ = '../_base_/default_runtime.py'
# dataset settings
dataset_type = 'CocoDataset'
classes = ('device', 'block')
data_root = '/home/ryanyu/ViTDet/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
image_size = (1024, 1024)

file_client_args = dict(backend='disk')
# comment out the code below to use different file client
# file_client_args = dict(
#     backend='petrel',
#     path_mapping=dict({
#         './data/': 's3://openmmlab/datasets/detection/',
#         'data/': 's3://openmmlab/datasets/detection/'
#     }))

train_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(
        type='Resize',
        img_scale=image_size,
        ratio_range=(0.1, 2.0),
        multiscale_mode='range',
        keep_ratio=True),
    dict(
        type='RandomCrop',
        crop_type='absolute_range',
        crop_size=image_size,
        recompute_bbox=True,
        allow_negative_crop=True),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size=image_size),  # padding to image_size leads 0.5+ mAP
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=1024),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

# Use RepeatDataset to speed up training
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='RepeatDataset',
        times=4,  # simply change this from 2 to 16 for 50e - 400e training.
        dataset=dict(
            type=dataset_type,
            classes=classes,
            ann_file=data_root + 'annotations/instances_train.json',
            img_prefix=data_root + 'train/',
            pipeline=train_pipeline)),
    val=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/instances_test.json',
        img_prefix=data_root + 'test/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/instances_test.json',
        img_prefix=data_root + 'test/',
        pipeline=test_pipeline))
evaluation = dict(interval=1, metric=['bbox'])

# optimizer assumes bs=64
optimizer = dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.00004)
optimizer_config = dict(grad_clip=None)

lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.067,
    step=[22, 24])
runner = dict(type='EpochBasedRunner', max_epochs=25)
norm_cfg = dict(type='LN', requires_grad=False)
# Use MMSyncBN that handles empty tensor in head. It can be changed to
# SyncBN after https://github.com/pytorch/pytorch/issues/36530 is fixed
# Requires MMCV-full after  https://github.com/open-mmlab/mmcv/pull/1205.
head_norm_cfg = dict(type='LN', requires_grad=False)

pretrained = None  # noqa
# model settings
model = dict(
    type='MaskRCNN',
    pretrained=pretrained,
    backbone=dict(
        type='ViTAE',
        img_size=1024,
        embed_dim=768,
        depth=12,
        num_heads=12,
        num_classes=2,
        mlp_ratio=4,
        qkv_bias=True,
        qk_scale=None,
        drop_rate=0.,
        attn_drop_rate=0.,
        drop_path_rate=0.1,
        use_abs_pos_emb=True,
        use_checkpoint=True
        ),
    neck=dict(
        type='FPN',
        in_channels=[768, 768, 768, 768],
        out_channels=256,
        norm_cfg=norm_cfg,
        use_residual=False,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        num_convs=2,
        norm_cfg=head_norm_cfg,
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[.0, .0, .0, .0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared4Conv1FCBBoxHead',
            conv_out_channels=256,
            norm_cfg=head_norm_cfg,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=2,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0., 0., 0., 0.],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
        mask_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        mask_head=dict(
            type='FCNMaskHead',
            num_convs=4,
            in_channels=256,
            conv_out_channels=256,
            num_classes=2,
            norm_cfg=head_norm_cfg,
            loss_mask=dict(
                type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))),
    # model training and testing settings
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100,
            mask_thr_binary=0.5)))

optimizer = dict(
    _delete_=True,
    type='AdamW',
    lr=0.0001,
    betas=(0.9, 0.999),
    weight_decay=0.1,
    constructor='LayerDecayOptimizerConstructor', 
                 paramwise_cfg=dict(
                        num_layers=12, 
                        layer_decay_rate=0.7,
                        custom_keys={
                            'bias': dict(decay_multi=0.),
                            'pos_embed': dict(decay_mult=0.),
                            'relative_position_bias_table': dict(decay_mult=0.),
                            'norm': dict(decay_mult=0.),
                            "rel_pos_h": dict(decay_mult=0.),
                            "rel_pos_w": dict(decay_mult=0.),
                            }
                            )
                 )
lr_config = dict(warmup_iters=250) # 16 * 1000 == 250 * 64

Thanks for assistance!

@robinren03 robinren03 changed the title with customize dataset Runtime failure with customize dataset Dec 19, 2023
@robinren03
Copy link
Author

My environment is as follows:
Pytorch version: 1.13.1
TorchVision: 0.14.0+cu117
OpenCV: 4.8.1
MMCV: 1.3.18 (with carefully modification to remove #include<HCN/HCN.h> and cherry-pick to make it compatible with Pytorch 11+)
MMCV Compiler: GCC 11.4
MMCV CUDA Compiler: 11.7
MMDetection: 2.18.0+19dd30a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant