Skip to content

🐛 [Bug] Segfault when trying to export detectron2 model (GeneralizedRCNN) to Torch-TensorRT #1932

Closed
@anandrajaraman

Description

@anandrajaraman

Bug Description

I trained a detectron2 model (GeneralizedRCNN) as found on detectron2 repo and keep running into a Segfault upon trying to export the trained weights with Torch-TensorRT following the example instructions.
I used the export_model.py with scripting mode to export a GeneralizedRCNN scripted model having a Resnet-50 backbone.

I built a docker environment as available in the repo here (#1852) to use Pytorch 2.1.0, Torch-TensorRT 1.5.0, TensorRT 8.6, CUDA 11.8, CuDNN 8.8.
I have also tried exporting the same with a stable release version of Torch-TensorRT 1.3.0 and still keep getting the Segfault.

Can you provide any guidance or info related to these errors or if you have tested Torch-TensorRT with any of the detectron2 model zoo?

To Reproduce

Steps to reproduce the behavior:

  1. Get a GeneralizedRCNN model with Resnet-50 backbone as found on detectron2 repo
  2. Added the following snippet to call Torch-TRT compile module on a scripted model (i.e torch.jit.script) in export_model.py after L102
        # Build Torchscript-TRT module for export
        trt_ts_model = torchtrt.compile(ts_model,
                                    inputs=[input_tensor],
                                    enabled_precisions={torch.half},
                                    min_block_size=3,
                                    workspace_size=1 << 32)
        with PathManager.open(os.path.join(output, "model_torch_trt.ts"), "wb") as f:
            trt_ts_model.save(trt_ts_model, f)
  1. Run the export_model.py script to export the model under scripting mode and
DEBUG: [Torch-TensorRT] - Setting node %23878 : Tensor = aten::_convolution(%21058, %self.model.backbone.bottom_up.stages.2.5.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23876, %23877, %144, %23876, %23876, %23876, %23876) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21068 : Tensor = aten::add(%out.129, %21038, %144) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:208:8 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %3710 : Tensor = aten::relu(%21068) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:209:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23881 : Tensor = aten::_convolution(%3710, %self.model.backbone.bottom_up.stages.3.0.conv1.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %141, %132, %139, %23879, %23880, %144, %23879, %23879, %23879, %23879) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21156 : Tensor = aten::relu(%out.2) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:196:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23884 : Tensor = aten::_convolution(%21156, %self.model.backbone.bottom_up.stages.3.0.conv2.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %139, %139, %23882, %23883, %144, %23882, %23882, %23882, %23882) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21166 : Tensor = aten::relu(%out.10) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:199:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23887 : Tensor = aten::_convolution(%21166, %self.model.backbone.bottom_up.stages.3.0.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23885, %23886, %144, %23885, %23885, %23885, %23885) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23890 : Tensor = aten::_convolution(%3710, %self.model.backbone.bottom_up.stages.3.0.shortcut.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %141, %132, %139, %23888, %23889, %144, %23888, %23888, %23888, %23888) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21196 : Tensor = aten::relu(%out.24) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:196:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23896 : Tensor = aten::_convolution(%21196, %self.model.backbone.bottom_up.stages.3.1.conv2.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %139, %139, %23894, %23895, %144, %23894, %23894, %23894, %23894) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21206 : Tensor = aten::relu(%out.28) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:199:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23899 : Tensor = aten::_convolution(%21206, %self.model.backbone.bottom_up.stages.3.1.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23897, %23898, %144, %23897, %23897, %23897, %23897) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21227 : Tensor = aten::relu(%out.1) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:196:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23905 : Tensor = aten::_convolution(%21227, %self.model.backbone.bottom_up.stages.3.2.conv2.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %139, %139, %23903, %23904, %144, %23903, %23903, %23903, %23903) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21237 : Tensor = aten::relu(%out.9) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:199:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23908 : Tensor = aten::_convolution(%21237, %self.model.backbone.bottom_up.stages.3.2.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23906, %23907, %144, %23906, %23906, %23906, %23906) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21247 : Tensor = aten::add(%out.17, %21217, %144) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:208:8 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %3722 : Tensor = aten::relu(%21247) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:209:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %1228 : Tensor = aten::max_pool2d(%top_block_in_feature, %139, %141, %132, %139, %182) # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:788:11 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
Segmentation fault (core dumped)


Expected behavior

As shown above, as the Torch-TRT compile module is called, the conversion will error out mid-way with a Segmentation fault (core dumped). Have tried it with different verisons and the error persists.
The expected behaviour would be the model gets exported and the model_torch_trt.ts is generated for use.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

All the packages and versions here come from the Torch-TensorRT docker container available in the instructions here

  • Torch-TensorRT Version (e.g. 1.0.0): 1.5.0.dev0+ac3ab77a
  • PyTorch Version (e.g. 1.0): 2.1.0+dev20230419
  • CPU Architecture:
  • OS (e.g., Linux): Ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): docker installed from the build instructions
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version: 3.10
  • CUDA version: 11.8
  • GPU models and configuration: Nvidia RTX 3050 Ti
  • Any other relevant information:

Additional context

Additional Torch-TRT compile spec info

DEBUG: [Torch-TensorRT] - TensorRT Compile Spec: {
    "Inputs": [
Input(shape=(1,3,1344,1344,), dtype=Half, format=Contiguous/Linear/NCHW, tensor_domain=[0, 2))    ]
    "Enabled Precision": [Half, ]
    "TF32 Disabled": 0
    "Sparsity": 0
    "Refit": 0
    "Debug": 0
    "Device":  {
        "device_type": GPU
        "allow_gpu_fallback": False
        "gpu_id": 0
        "dla_core": -1
    }

    "Engine Capability": Default
    "Num Avg Timing Iters": 1
    "Workspace Size": 4294967296
    "DLA SRAM Size": 1048576
    "DLA Local DRAM Size": 1073741824
    "DLA Global DRAM Size": 536870912
    "Truncate long and double": 0
    "Torch Fallback":  {
        "enabled": True
        "min_block_size": 3
        "forced_fallback_operators": [
        ]
        "forced_fallback_modules": [
        ]
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions