Skip to content

🐛 [Bug] Segfault when trying to export detectron2 model (GeneralizedRCNN) to Torch-TensorRT #1932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anandrajaraman opened this issue May 18, 2023 · 2 comments
Labels
bug Something isn't working No Activity

Comments

@anandrajaraman
Copy link

anandrajaraman commented May 18, 2023

Bug Description

I trained a detectron2 model (GeneralizedRCNN) as found on detectron2 repo and keep running into a Segfault upon trying to export the trained weights with Torch-TensorRT following the example instructions.
I used the export_model.py with scripting mode to export a GeneralizedRCNN scripted model having a Resnet-50 backbone.

I built a docker environment as available in the repo here (#1852) to use Pytorch 2.1.0, Torch-TensorRT 1.5.0, TensorRT 8.6, CUDA 11.8, CuDNN 8.8.
I have also tried exporting the same with a stable release version of Torch-TensorRT 1.3.0 and still keep getting the Segfault.

Can you provide any guidance or info related to these errors or if you have tested Torch-TensorRT with any of the detectron2 model zoo?

To Reproduce

Steps to reproduce the behavior:

  1. Get a GeneralizedRCNN model with Resnet-50 backbone as found on detectron2 repo
  2. Added the following snippet to call Torch-TRT compile module on a scripted model (i.e torch.jit.script) in export_model.py after L102
        # Build Torchscript-TRT module for export
        trt_ts_model = torchtrt.compile(ts_model,
                                    inputs=[input_tensor],
                                    enabled_precisions={torch.half},
                                    min_block_size=3,
                                    workspace_size=1 << 32)
        with PathManager.open(os.path.join(output, "model_torch_trt.ts"), "wb") as f:
            trt_ts_model.save(trt_ts_model, f)
  1. Run the export_model.py script to export the model under scripting mode and
DEBUG: [Torch-TensorRT] - Setting node %23878 : Tensor = aten::_convolution(%21058, %self.model.backbone.bottom_up.stages.2.5.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23876, %23877, %144, %23876, %23876, %23876, %23876) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21068 : Tensor = aten::add(%out.129, %21038, %144) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:208:8 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %3710 : Tensor = aten::relu(%21068) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:209:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23881 : Tensor = aten::_convolution(%3710, %self.model.backbone.bottom_up.stages.3.0.conv1.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %141, %132, %139, %23879, %23880, %144, %23879, %23879, %23879, %23879) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21156 : Tensor = aten::relu(%out.2) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:196:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23884 : Tensor = aten::_convolution(%21156, %self.model.backbone.bottom_up.stages.3.0.conv2.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %139, %139, %23882, %23883, %144, %23882, %23882, %23882, %23882) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21166 : Tensor = aten::relu(%out.10) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:199:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23887 : Tensor = aten::_convolution(%21166, %self.model.backbone.bottom_up.stages.3.0.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23885, %23886, %144, %23885, %23885, %23885, %23885) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23890 : Tensor = aten::_convolution(%3710, %self.model.backbone.bottom_up.stages.3.0.shortcut.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %141, %132, %139, %23888, %23889, %144, %23888, %23888, %23888, %23888) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21196 : Tensor = aten::relu(%out.24) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:196:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23896 : Tensor = aten::_convolution(%21196, %self.model.backbone.bottom_up.stages.3.1.conv2.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %139, %139, %23894, %23895, %144, %23894, %23894, %23894, %23894) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21206 : Tensor = aten::relu(%out.28) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:199:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23899 : Tensor = aten::_convolution(%21206, %self.model.backbone.bottom_up.stages.3.1.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23897, %23898, %144, %23897, %23897, %23897, %23897) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21227 : Tensor = aten::relu(%out.1) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:196:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23905 : Tensor = aten::_convolution(%21227, %self.model.backbone.bottom_up.stages.3.2.conv2.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %139, %139, %23903, %23904, %144, %23903, %23903, %23903, %23903) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21237 : Tensor = aten::relu(%out.9) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:199:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %23908 : Tensor = aten::_convolution(%21237, %self.model.backbone.bottom_up.stages.3.2.conv3.weight, %self.model.backbone.bottom_up.stem.conv1.bias.443, %139, %132, %139, %23906, %23907, %144, %23906, %23906, %23906, %23906) to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %21247 : Tensor = aten::add(%out.17, %21217, %144) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:208:8 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %3722 : Tensor = aten::relu(%21247) # /usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/resnet.py:209:14 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
DEBUG: [Torch-TensorRT] - Setting node %1228 : Tensor = aten::max_pool2d(%top_block_in_feature, %139, %141, %132, %139, %182) # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:788:11 to run torch due owning block not large enough to exceed user specified min_block_size (previously was to run in tensorrt)
Segmentation fault (core dumped)


Expected behavior

As shown above, as the Torch-TRT compile module is called, the conversion will error out mid-way with a Segmentation fault (core dumped). Have tried it with different verisons and the error persists.
The expected behaviour would be the model gets exported and the model_torch_trt.ts is generated for use.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

All the packages and versions here come from the Torch-TensorRT docker container available in the instructions here

  • Torch-TensorRT Version (e.g. 1.0.0): 1.5.0.dev0+ac3ab77a
  • PyTorch Version (e.g. 1.0): 2.1.0+dev20230419
  • CPU Architecture:
  • OS (e.g., Linux): Ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): docker installed from the build instructions
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version: 3.10
  • CUDA version: 11.8
  • GPU models and configuration: Nvidia RTX 3050 Ti
  • Any other relevant information:

Additional context

Additional Torch-TRT compile spec info

DEBUG: [Torch-TensorRT] - TensorRT Compile Spec: {
    "Inputs": [
Input(shape=(1,3,1344,1344,), dtype=Half, format=Contiguous/Linear/NCHW, tensor_domain=[0, 2))    ]
    "Enabled Precision": [Half, ]
    "TF32 Disabled": 0
    "Sparsity": 0
    "Refit": 0
    "Debug": 0
    "Device":  {
        "device_type": GPU
        "allow_gpu_fallback": False
        "gpu_id": 0
        "dla_core": -1
    }

    "Engine Capability": Default
    "Num Avg Timing Iters": 1
    "Workspace Size": 4294967296
    "DLA SRAM Size": 1048576
    "DLA Local DRAM Size": 1073741824
    "DLA Global DRAM Size": 536870912
    "Truncate long and double": 0
    "Torch Fallback":  {
        "enabled": True
        "min_block_size": 3
        "forced_fallback_operators": [
        ]
        "forced_fallback_modules": [
        ]
    }
}

@anandrajaraman anandrajaraman added the bug Something isn't working label May 18, 2023
@github-actions
Copy link

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Copy link

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working No Activity
Projects
None yet
Development

No branches or pull requests

2 participants