Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Coverage] Support for non-zero output_padding in IDeconvolutionLayer (RuntimeError: Target aten.convolution.default does not support transposed=True ) #3344

Open
Tracked by #3179
chohk88 opened this issue Jan 3, 2025 · 1 comment · May be fixed by #3343
Assignees

Comments

@chohk88
Copy link
Collaborator

chohk88 commented Jan 3, 2025

Description:

The runtime error occurs because deconvolution layers with non-zero output_padding are not supported in the converter. A validator is already in place to flag these cases in the code here. This limitation arises because TensorRT's IDeconvolutionLayer cannot directly handle output_padding. While alternatives like pre_padding and post_padding exist, they cannot fully replicate output_padding.

Current Observations:

  1. API Limitations:
    Attempts to use pre_padding and post_padding (as documented in TensorRT API) to handle output_padding have been unsuccessful. Most cases involve output_padding=1, and even in these cases, achieving the desired functionality is challenging.

  2. Exporting to ONNX:
    Exporting deconvolution operation with non-zero output_padding to ONNX and running them with trtexec does not produce errors. The following log snippet from trtexec suggests that pre_padding and post_padding may be involved internally during the deconvolution operation:

    [01/02/2025-15:30:59] [V] [TRT] Running deconvolution with: 
    Padding mode: NOTSET
    Pre-padding: (1, 1)
    Post-padding: (0, 0)
    [01/02/2025-15:30:59] [V] [TRT] Registering layer: /deconv/ConvTranspose for ONNX node: /deconv/ConvTranspose
    
  3. Legacy FX Converter:
    The current CTX converter does not support deconv ops flagged by the validator, causing runtime errors in the legacy FX converter. Removing the FX converter avoids runtime errors but introduces graph breaks. Related PR #3343.

Suggested Steps for Resolution:

  1. Converter Improvements:
    Develop a Torch-TensorRT converter that specifically addresses cases where output_padding=1. This would mitigate the issue for a significant portion of affected models. However, achieving this might be technically challenging given the current API limitations.

  2. Temporary Workaround:
    Remove the legacy FX converter and rely on graph breaks. While this approach sacrifices some optimization, it avoids runtime errors and enables the model to execute correctly.

How to Reproduce the Runtime Error:

The following code demonstrates the issue by compiling and running a U-Net model with non-zero output_padding using Torch-TensorRT:

import torch
import torch_tensorrt
from monai.networks.nets import UNet

device = "cuda:0"

# Define a 2D U-Net model
model = UNet(
    spatial_dims=2,
    in_channels=3,
    out_channels=2,
    channels=(16, 32, 64, 128),
    strides=(2, 2, 2),
    num_res_units=2,
    act="relu",
    norm="batch",
    dropout=0.1,
).to(device).half().eval()

# Input tensor with non-zero output_padding in deconvolution layers
input_tensor = torch.randn(1, 3, 256, 256, device=device).half()

# Export the model
exported_program = torch.export.export(model, (input_tensor,))

# Compile using Torch-TensorRT
trt_model = torch_tensorrt.dynamo.compile(
    exported_program,
    inputs=[input_tensor],
    enabled_precisions={torch.float16},
    use_python_runtime=False,
    truncate_double=True,
    debug=True,
    min_block_size=1,
    device=device,  
)

# Run inference
with torch.no_grad():
    output = trt_model(input_tensor)

print("TRT-compiled model output shape:", output.shape)

Full logs:

&&&& RUNNING TensorRT.trtexec [TensorRT v100600] [b18] # ./TensorRT-10.6.0.18/bin/trtexec --onnx=test_outpad.onnx --dumpLayerInfo --verbose
[01/02/2025-15:30:56] [I] === Model Options ===
[01/02/2025-15:30:56] [I] Format: ONNX
[01/02/2025-15:30:56] [I] Model: test_outpad.onnx
[01/02/2025-15:30:56] [I] Output:
[01/02/2025-15:30:56] [I] === Build Options ===
[01/02/2025-15:30:56] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[01/02/2025-15:30:56] [I] avgTiming: 8
[01/02/2025-15:30:56] [I] Precision: FP32
[01/02/2025-15:30:56] [I] LayerPrecisions: 
[01/02/2025-15:30:56] [I] Layer Device Types: 
[01/02/2025-15:30:56] [I] Calibration: 
[01/02/2025-15:30:56] [I] Refit: Disabled
[01/02/2025-15:30:56] [I] Strip weights: Disabled
[01/02/2025-15:30:56] [I] Version Compatible: Disabled
[01/02/2025-15:30:56] [I] ONNX Plugin InstanceNorm: Disabled
[01/02/2025-15:30:56] [I] TensorRT runtime: full
[01/02/2025-15:30:56] [I] Lean DLL Path: 
[01/02/2025-15:30:56] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[01/02/2025-15:30:56] [I] Exclude Lean Runtime: Disabled
[01/02/2025-15:30:56] [I] Sparsity: Disabled
[01/02/2025-15:30:56] [I] Safe mode: Disabled
[01/02/2025-15:30:56] [I] Build DLA standalone loadable: Disabled
[01/02/2025-15:30:56] [I] Allow GPU fallback for DLA: Disabled
[01/02/2025-15:30:56] [I] DirectIO mode: Disabled
[01/02/2025-15:30:56] [I] Restricted mode: Disabled
[01/02/2025-15:30:56] [I] Skip inference: Disabled
[01/02/2025-15:30:56] [I] Save engine: 
[01/02/2025-15:30:56] [I] Load engine: 
[01/02/2025-15:30:56] [I] Profiling verbosity: 0
[01/02/2025-15:30:56] [I] Tactic sources: Using default tactic sources
[01/02/2025-15:30:56] [I] timingCacheMode: local
[01/02/2025-15:30:56] [I] timingCacheFile: 
[01/02/2025-15:30:56] [I] Enable Compilation Cache: Enabled
[01/02/2025-15:30:56] [I] Enable Monitor Memory: Disabled
[01/02/2025-15:30:56] [I] errorOnTimingCacheMiss: Disabled
[01/02/2025-15:30:56] [I] Preview Features: Use default preview flags.
[01/02/2025-15:30:56] [I] MaxAuxStreams: -1
[01/02/2025-15:30:56] [I] BuilderOptimizationLevel: -1
[01/02/2025-15:30:56] [I] MaxTactics: -1
[01/02/2025-15:30:56] [I] Calibration Profile Index: 0
[01/02/2025-15:30:56] [I] Weight Streaming: Disabled
[01/02/2025-15:30:56] [I] Runtime Platform: Same As Build
[01/02/2025-15:30:56] [I] Debug Tensors: 
[01/02/2025-15:30:56] [I] Input(s)s format: fp32:CHW
[01/02/2025-15:30:56] [I] Output(s)s format: fp32:CHW
[01/02/2025-15:30:56] [I] Input build shapes: model
[01/02/2025-15:30:56] [I] Input calibration shapes: model
[01/02/2025-15:30:56] [I] === System Options ===
[01/02/2025-15:30:56] [I] Device: 0
[01/02/2025-15:30:56] [I] DLACore: 
[01/02/2025-15:30:56] [I] Plugins:
[01/02/2025-15:30:56] [I] setPluginsToSerialize:
[01/02/2025-15:30:56] [I] dynamicPlugins:
[01/02/2025-15:30:56] [I] ignoreParsedPluginLibs: 0
[01/02/2025-15:30:56] [I] 
[01/02/2025-15:30:56] [I] === Inference Options ===
[01/02/2025-15:30:56] [I] Batch: Explicit
[01/02/2025-15:30:56] [I] Input inference shapes: model
[01/02/2025-15:30:56] [I] Iterations: 10
[01/02/2025-15:30:56] [I] Duration: 3s (+ 200ms warm up)
[01/02/2025-15:30:56] [I] Sleep time: 0ms
[01/02/2025-15:30:56] [I] Idle time: 0ms
[01/02/2025-15:30:56] [I] Inference Streams: 1
[01/02/2025-15:30:56] [I] ExposeDMA: Disabled
[01/02/2025-15:30:56] [I] Data transfers: Enabled
[01/02/2025-15:30:56] [I] Spin-wait: Disabled
[01/02/2025-15:30:56] [I] Multithreading: Disabled
[01/02/2025-15:30:56] [I] CUDA Graph: Disabled
[01/02/2025-15:30:56] [I] Separate profiling: Disabled
[01/02/2025-15:30:56] [I] Time Deserialize: Disabled
[01/02/2025-15:30:56] [I] Time Refit: Disabled
[01/02/2025-15:30:56] [I] NVTX verbosity: 0
[01/02/2025-15:30:56] [I] Persistent Cache Ratio: 0
[01/02/2025-15:30:56] [I] Optimization Profile Index: 0
[01/02/2025-15:30:56] [I] Weight Streaming Budget: 100.000000%
[01/02/2025-15:30:56] [I] Inputs:
[01/02/2025-15:30:56] [I] Debug Tensor Save Destinations:
[01/02/2025-15:30:56] [I] === Reporting Options ===
[01/02/2025-15:30:56] [I] Verbose: Enabled
[01/02/2025-15:30:56] [I] Averages: 10 inferences
[01/02/2025-15:30:56] [I] Percentiles: 90,95,99
[01/02/2025-15:30:56] [I] Dump refittable layers:Disabled
[01/02/2025-15:30:56] [I] Dump output: Disabled
[01/02/2025-15:30:56] [I] Profile: Disabled
[01/02/2025-15:30:56] [I] Export timing to JSON file: 
[01/02/2025-15:30:56] [I] Export output to JSON file: 
[01/02/2025-15:30:56] [I] Export profile to JSON file: 
[01/02/2025-15:30:56] [I] 
[01/02/2025-15:30:56] [I] === Device Information ===
[01/02/2025-15:30:56] [I] Available Devices: 
[01/02/2025-15:30:56] [I]   Device 0: "NVIDIA A40" UUID: GPU-6bac0144-7e2a-fedf-d578-bbb04062a8dd
[01/02/2025-15:30:56] [I]   Device 1: "NVIDIA A40" UUID: GPU-35fc8ec9-442b-c030-29e6-fac81c22e0ca
[01/02/2025-15:30:56] [I] Selected Device: NVIDIA A40
[01/02/2025-15:30:56] [I] Selected Device ID: 0
[01/02/2025-15:30:56] [I] Selected Device UUID: GPU-6bac0144-7e2a-fedf-d578-bbb04062a8dd
[01/02/2025-15:30:56] [I] Compute Capability: 8.6
[01/02/2025-15:30:56] [I] SMs: 84
[01/02/2025-15:30:56] [I] Device Global Memory: 45416 MiB
[01/02/2025-15:30:56] [I] Shared Memory per SM: 100 KiB
[01/02/2025-15:30:56] [I] Memory Bus Width: 384 bits (ECC enabled)
[01/02/2025-15:30:56] [I] Application Compute Clock Rate: 1.74 GHz
[01/02/2025-15:30:56] [I] Application Memory Clock Rate: 7.251 GHz
[01/02/2025-15:30:56] [I] 
[01/02/2025-15:30:56] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[01/02/2025-15:30:56] [I] 
[01/02/2025-15:30:56] [I] TensorRT version: 10.6.0
[01/02/2025-15:30:56] [I] Loading standard plugins
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 2
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 3
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ModulatedDeformConv2d version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Proposal version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 2
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ScatterElements version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ScatterElements version 2
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::Split version 1
[01/02/2025-15:30:56] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[01/02/2025-15:30:56] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 20, GPU 268 (MiB)
[01/02/2025-15:30:57] [V] [TRT] Trying to load shared library libnvinfer_builder_resource.so.10.6.0
[01/02/2025-15:30:57] [V] [TRT] Loaded shared library libnvinfer_builder_resource.so.10.6.0
[01/02/2025-15:30:59] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +2166, GPU +406, now: CPU 2342, GPU 674 (MiB)
[01/02/2025-15:30:59] [V] [TRT] CUDA lazy loading is enabled.
[01/02/2025-15:30:59] [I] Start parsing network model.
[01/02/2025-15:30:59] [I] [TRT] ----------------------------------------------------------------
[01/02/2025-15:30:59] [I] [TRT] Input filename:   test_outpad.onnx
[01/02/2025-15:30:59] [I] [TRT] ONNX IR version:  0.0.7
[01/02/2025-15:30:59] [I] [TRT] Opset version:    13
[01/02/2025-15:30:59] [I] [TRT] Producer name:    pytorch
[01/02/2025-15:30:59] [I] [TRT] Producer version: 2.6.0
[01/02/2025-15:30:59] [I] [TRT] Domain:           
[01/02/2025-15:30:59] [I] [TRT] Model version:    0
[01/02/2025-15:30:59] [I] [TRT] Doc string:       
[01/02/2025-15:30:59] [I] [TRT] ----------------------------------------------------------------
[01/02/2025-15:30:59] [V] [TRT] Adding network input: onnx::ConvTranspose_0 with dtype: float32, dimensions: (1, 3, 16, 16)
[01/02/2025-15:30:59] [V] [TRT] Registering tensor: onnx::ConvTranspose_0 for ONNX tensor: onnx::ConvTranspose_0
[01/02/2025-15:30:59] [V] [TRT] Importing initializer: deconv.weight
[01/02/2025-15:30:59] [V] [TRT] Importing initializer: deconv.bias
[01/02/2025-15:30:59] [V] [TRT] Static check for parsing node: /deconv/ConvTranspose [ConvTranspose]
[01/02/2025-15:30:59] [V] [TRT] Parsing node: /deconv/ConvTranspose [ConvTranspose]
[01/02/2025-15:30:59] [V] [TRT] Searching for input: onnx::ConvTranspose_0
[01/02/2025-15:30:59] [V] [TRT] Searching for input: deconv.weight
[01/02/2025-15:30:59] [V] [TRT] Searching for input: deconv.bias
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose [ConvTranspose] inputs: [onnx::ConvTranspose_0 -> (1, 3, 16, 16)[FLOAT]], [deconv.weight -> (3, 3, 3, 3)[FLOAT]], [deconv.bias -> (3)[FLOAT]], 
[01/02/2025-15:30:59] [V] [TRT] Running deconvolution with: 
Padding mode: NOTSET
Pre-padding: (1, 1)
Post-padding: (0, 0)
[01/02/2025-15:30:59] [V] [TRT] Registering layer: /deconv/ConvTranspose for ONNX node: /deconv/ConvTranspose
[01/02/2025-15:30:59] [V] [TRT] Registering tensor: 3_0 for ONNX tensor: 3
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose [ConvTranspose] outputs: [3 -> (1, 3, 32, 32)[FLOAT]], 
[01/02/2025-15:30:59] [V] [TRT] Marking 3_0 as output: 3
[01/02/2025-15:30:59] [I] Finished parsing network model. Parse time: 0.00115429
[01/02/2025-15:30:59] [V] [TRT] Original: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After dead-layer removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] Graph construction completed in 0.000131549 seconds.
[01/02/2025-15:30:59] [V] [TRT] After adding DebugOutput nodes: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After Myelin optimization: 1 layers
[01/02/2025-15:30:59] [V] [TRT] Applying ScaleNodes fusions.
[01/02/2025-15:30:59] [V] [TRT] After scale fusion: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After dupe layer removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After final dead-layer removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After tensor merging: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After vertical fusions: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After dupe layer removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After final dead-layer removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After tensor merging: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After slice removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] After concat removal: 1 layers
[01/02/2025-15:30:59] [V] [TRT] Trying to split Reshape and strided tensor
[01/02/2025-15:30:59] [V] [TRT] Graph optimization time: 0.000249508 seconds.
[01/02/2025-15:30:59] [V] [TRT] Building graph using backend strategy 2
[01/02/2025-15:30:59] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[01/02/2025-15:30:59] [V] [TRT] Constructing optimization profile number 0 [1/1].
[01/02/2025-15:30:59] [V] [TRT] Applying generic optimizations to the graph for inference.
[01/02/2025-15:30:59] [V] [TRT] Reserving memory for host IO tensors. Host: 0 bytes
[01/02/2025-15:30:59] [V] [TRT] =============== Computing costs for /deconv/ConvTranspose
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning format combination: Float(768,256,16,1) -> Float(3072,1024,32,1) ***************
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose: 27 available tactics, 3 unparsable, 12 pruned, 15 remaining after tactic pruning.
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: /deconv/ConvTranspose (CaskGemmDeconvolution[0x80000037])
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm50_xmma_cublas_smallN_NN_f32f32_f32_f32_nn_n_thread_count256threads_per_row16b_elems_per_thread2bias_or_reluFalse numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020241 Time: 0.00985966
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm50_xmma_cublas_smallN_NN_f32f32_f32_f32_nn_n_thread_count256threads_per_row8b_elems_per_thread2bias_or_reluFalse numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020690 Time: 0.00956709
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm50_xmma_cublas_smallN_NN_f32f32_f32_f32_nn_n_thread_count128threads_per_row2b_elems_per_thread4bias_or_reluFalse numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x00000000000206c7 Time: 0.00959634
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32x32x8_stage3_warpsize1x2x1_ffma_aligna4_alignc4 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020a52 Time: 0.00940743
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: ampere_sgemm_32x32_sliced1x4_nn_v1 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x00000000000202ef Time: 0.00932571
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: ampere_sgemm_32x32_sliced1x4_relu_nn_v1 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x000000000002086f Time: 0.00930743
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_gemm_f32f32_tf32f32_f32_nn_n_tilesize32x32x64_stage3_warpsize2x2x1_tensor16x8x8_aligna4_alignc4 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x00000000000206d5 Time: 0.0096256
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32x64x8_stage3_warpsize1x2x1_ffma_aligna4_alignc4 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x00000000000208f0 Time: 0.00941714
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize64x32x8_stage3_warpsize1x2x1_ffma_aligna4_alignc4 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020672 Time: 0.00962469
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: ampere_sgemm_64x32_sliced1x4_nn_v1 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x000000000002036b Time: 0.00928914
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: ampere_sgemm_64x32_sliced1x4_relu_nn_v1 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020819 Time: 0.00938972
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_gemm_f32f32_tf32f32_f32_nn_n_tilesize32x64x64_stage4_warpsize2x2x1_tensor16x8x8_aligna4_alignc4 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x000000000002081f Time: 0.00949943
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32x128x8_stage3_warpsize1x2x1_ffma_aligna4_alignc4 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020651 Time: 0.00934372
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: ampere_sgemm_128x32_sliced1x4_nn_v1 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x0000000000020a9a Time: 0.0100638
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: ampere_sgemm_128x32_sliced1x4_relu_nn_v1 numSplitK: 1 numBuffers: 0 numKernels: 1 Tactic: 0x000000000002041e Time: 0.0105012
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose (CaskGemmDeconvolution[0x80000037]) profiling completed in 0.0894895 seconds. Fastest Tactic: 0x000000000002036b Time: 0.00928914
[01/02/2025-15:30:59] [V] [TRT] Skipping CaskDeconvolution: No valid tactics for /deconv/ConvTranspose
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: /deconv/ConvTranspose (CaskDeconvolutionV2[0x8000002d])
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm50_xmma_deconv_generic_f32f32_f32_f32_nchwkcrs_nchw Tactic: 0x0f630dccfe13bf53 Time: 10000
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose (CaskDeconvolutionV2[0x8000002d]) profiling completed in 0.000627884 seconds. Fastest Tactic: 0x0f630dccfe13bf53 Time: 10000
[01/02/2025-15:30:59] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskGemmDeconvolution Tactic: 0x000000000002036b
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning format combination: Float(768,1,48,3) -> Float(3072,1,96,3) ***************
[01/02/2025-15:30:59] [V] [TRT] Skipping CaskDeconvolution: No valid tactics for /deconv/ConvTranspose
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: /deconv/ConvTranspose (CaskDeconvolutionV2[0x8000002d])
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x128x8_stage3_warpsize2x2x1_g1_ffma_aligna4_alignc4 Tactic: 0x24bd5d7c8284eeec Time: 0.022673
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize32x32x8_stage3_warpsize1x2x1_g1_ffma_aligna4_alignc4 Tactic: 0x14499f757787b157 Time: 0.00630201
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize256x128x8_stage3_warpsize4x2x1_g1_ffma_aligna4_alignc4 Tactic: 0xb1d8242d50afdff0 Time: 0.0285257
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x128x8_stage3_warpsize1x4x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x1a82ab99d94518f2 Time: 0.0100157
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize32x32x8_stage3_warpsize1x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x043c81cea95f3c97 Time: 0.00630301
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x32x8_stage3_warpsize2x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x0af5b8971b78dc1f Time: 0.00824889
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x64x8_stage3_warpsize1x4x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x55c5f197e3b7e8aa Time: 0.00761456
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x64x8_stage3_warpsize2x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0xba791a7991b0361c Time: 0.0103252
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x32x8_stage3_warpsize1x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0xf7c9fe3fcf824969 Time: 0.00777504
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose (CaskDeconvolutionV2[0x8000002d]) profiling completed in 0.0353847 seconds. Fastest Tactic: 0x14499f757787b157 Time: 0.00630201
[01/02/2025-15:30:59] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskDeconvolutionV2 Tactic: 0x14499f757787b157
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning format combination: Float(256,1:4,16,1) -> Float(1024,1:4,32,1) ***************
[01/02/2025-15:30:59] [V] [TRT] Skipping CaskDeconvolution: No valid tactics for /deconv/ConvTranspose
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: /deconv/ConvTranspose (CaskDeconvolutionV2[0x8000002d])
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm86_xmma_deconv_implicit_gemm_f32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize128x256x32_stage2_warpsize2x4x1_g1_tensor16x8x8 Tactic: 0x4298039fe7d925bf Time: 0.0292901
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x128x8_stage3_warpsize2x2x1_g1_ffma_aligna4_alignc4 Tactic: 0x24bd5d7c8284eeec Time: 0.0225489
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize32x32x8_stage3_warpsize1x2x1_g1_ffma_aligna4_alignc4 Tactic: 0x14499f757787b157 Time: 0.0062245
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize256x128x8_stage3_warpsize4x2x1_g1_ffma_aligna4_alignc4 Tactic: 0xb1d8242d50afdff0 Time: 0.0285013
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x128x8_stage3_warpsize1x4x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x1a82ab99d94518f2 Time: 0.0100044
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize128x128x16_stage4_warpsize2x2x1_g1_tensor16x8x8_strided Tactic: 0xa17395d1f0a7b52b Time: 0.0109505
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize32x32x8_stage3_warpsize1x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x043c81cea95f3c97 Time: 0.00630221
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_f32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize128x128x16_stage4_warpsize2x2x1_g1_tensor16x8x8 Tactic: 0x6f63be3116a0cf3a Time: 0.0121303
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x32x8_stage3_warpsize2x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x0af5b8971b78dc1f Time: 0.00830553
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x64x8_stage3_warpsize1x4x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0x55c5f197e3b7e8aa Time: 0.00769131
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x64x8_stage3_warpsize2x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0xba791a7991b0361c Time: 0.0102397
[01/02/2025-15:30:59] [V] [TRT] Tactic Name: sm80_xmma_deconv_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x32x8_stage3_warpsize1x2x1_g1_ffma_strided_aligna4_alignc4 Tactic: 0xf7c9fe3fcf824969 Time: 0.00776854
[01/02/2025-15:30:59] [V] [TRT] /deconv/ConvTranspose (CaskDeconvolutionV2[0x8000002d]) profiling completed in 0.0430016 seconds. Fastest Tactic: 0x14499f757787b157 Time: 0.0062245
[01/02/2025-15:30:59] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskDeconvolutionV2 Tactic: 0x14499f757787b157
[01/02/2025-15:30:59] [V] [TRT] =============== Computing reformatting costs for available format set
[01/02/2025-15:30:59] [V] [TRT] =============== Computing reformatting costs: 
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning Reformat: Float(768,256,16,1) -> Float(768,1,48,3) ***************
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: Optimizer Reformat(onnx::ConvTranspose_0 -> <out>) (Reformat[0x80000006])
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003e8 Time: 0.00389449
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003ea Time: 0.0100169
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x0000000000000000 Time: 0.00353907
[01/02/2025-15:30:59] [V] [TRT] Optimizer Reformat(onnx::ConvTranspose_0 -> <out>) (Reformat[0x80000006]) profiling completed in 0.0193154 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.00353907
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning Reformat: Float(768,256,16,1) -> Float(256,1:4,16,1) ***************
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: Optimizer Reformat(onnx::ConvTranspose_0 -> <out>) (Reformat[0x80000006])
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003e8 Time: 0.00387805
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003ea Time: 0.00999619
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x0000000000000000 Time: 0.00359863
[01/02/2025-15:30:59] [V] [TRT] Optimizer Reformat(onnx::ConvTranspose_0 -> <out>) (Reformat[0x80000006]) profiling completed in 0.00585987 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.00359863
[01/02/2025-15:30:59] [V] [TRT] =============== Computing reformatting costs for available format set
[01/02/2025-15:30:59] [V] [TRT] =============== Computing reformatting costs: 
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning Reformat: Float(3072,1,96,3) -> Float(3072,1024,32,1) ***************
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: Optimizer Reformat(<in> -> 3) (Reformat[0x80000006])
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003e8 Time: 0.00376878
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003ea Time: 0.0100952
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x0000000000000000 Time: 0.00358
[01/02/2025-15:30:59] [V] [TRT] Optimizer Reformat(<in> -> 3) (Reformat[0x80000006]) profiling completed in 0.00598178 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.00358
[01/02/2025-15:30:59] [V] [TRT] *************** Autotuning Reformat: Float(1024,1:4,32,1) -> Float(3072,1024,32,1) ***************
[01/02/2025-15:30:59] [V] [TRT] --------------- Timing Runner: Optimizer Reformat(<in> -> 3) (Reformat[0x80000006])
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003e8 Time: 0.00381065
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x00000000000003ea Time: 0.00991817
[01/02/2025-15:30:59] [V] [TRT] Tactic: 0x0000000000000000 Time: 0.00360651
[01/02/2025-15:30:59] [V] [TRT] Optimizer Reformat(<in> -> 3) (Reformat[0x80000006]) profiling completed in 0.00595198 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.00360651
[01/02/2025-15:30:59] [V] [TRT] Formats and tactics selection completed in 0.212726 seconds.
[01/02/2025-15:30:59] [V] [TRT] After reformat layers: 1 layers
[01/02/2025-15:30:59] [V] [TRT] Total number of blocks in pre-optimized block assignment: 1
[01/02/2025-15:30:59] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[01/02/2025-15:30:59] [V] [TRT] Layer: /deconv/ConvTranspose Host Persistent: 4512 bytes Device Persistent: 0 bytes Scratch Memory: 28672 bytes
[01/02/2025-15:30:59] [V] [TRT] Skipped printing memory information for 0 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
[01/02/2025-15:30:59] [I] [TRT] Total Host Persistent Memory: 4512 bytes
[01/02/2025-15:30:59] [I] [TRT] Total Device Persistent Memory: 0 bytes
[01/02/2025-15:30:59] [I] [TRT] Max Scratch Memory: 28672 bytes
[01/02/2025-15:30:59] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 1 steps to complete.
[01/02/2025-15:30:59] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.02405ms to assign 1 blocks to 1 nodes requiring 28672 bytes.
[01/02/2025-15:30:59] [V] [TRT] Total number of blocks in optimized block assignment: 1
[01/02/2025-15:30:59] [I] [TRT] Total Activation Memory: 28672 bytes
[01/02/2025-15:30:59] [I] [TRT] Total Weights Memory: 1036 bytes
[01/02/2025-15:30:59] [V] [TRT] Finalize: /deconv/ConvTranspose Set kernel index: 0
[01/02/2025-15:30:59] [V] [TRT] Total number of generated kernels selected for the engine: 1
[01/02/2025-15:30:59] [V] [TRT] Kernel: 0 CASK_STATIC
[01/02/2025-15:30:59] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/02/2025-15:30:59] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS
[01/02/2025-15:30:59] [I] [TRT] Engine generation completed in 0.243262 seconds.
[01/02/2025-15:30:59] [V] [TRT] Engine Layer Information:
Layer(CaskGemmDeconvolution): /deconv/ConvTranspose, Tactic: 0x000000000002036b, onnx::ConvTranspose_0 (Float[1,3,16,16]) -> 3 (Float[1,3,32,32])
[01/02/2025-15:30:59] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1 MiB
[01/02/2025-15:30:59] [V] [TRT] Adding 1 engine(s) to plan file.
[01/02/2025-15:30:59] [V] [TRT] Adding 1 engine weights(s) to plan file.
[01/02/2025-15:30:59] [I] Engine built in 0.246413 sec.
[01/02/2025-15:30:59] [I] Created engine with size: 0.00515366 MiB
[01/02/2025-15:31:00] [I] [TRT] Loaded engine size: 0 MiB
[01/02/2025-15:31:00] [V] [TRT] Deserialization required 271 microseconds.
[01/02/2025-15:31:00] [I] Engine deserialized in 0.00610918 sec.
[01/02/2025-15:31:00] [V] [TRT] Total per-runner device persistent memory is 0
[01/02/2025-15:31:00] [V] [TRT] Total per-runner host persistent memory is 4512
[01/02/2025-15:31:00] [V] [TRT] Allocated device scratch memory of size 28672
[01/02/2025-15:31:00] [V] [TRT] - Runner scratch: 28672 bytes
[01/02/2025-15:31:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[01/02/2025-15:31:00] [V] [TRT] CUDA lazy loading is enabled.
[01/02/2025-15:31:00] [I] Setting persistentCacheLimit to 0 bytes.
[01/02/2025-15:31:00] [I] Created execution context with device memory size: 0.0273438 MiB
[01/02/2025-15:31:00] [I] Using random values for input onnx::ConvTranspose_0
[01/02/2025-15:31:00] [I] Input binding for onnx::ConvTranspose_0 with dimensions 1x3x16x16 is created.
[01/02/2025-15:31:00] [I] Output binding for 3 with dimensions 1x3x32x32 is created.
[01/02/2025-15:31:00] [I] Layer Information:
[01/02/2025-15:31:00] [I] [TRT] The profiling verbosity was set to ProfilingVerbosity::kLAYER_NAMES_ONLY when the engine was built, so only the layer names will be returned. Rebuild the engine with ProfilingVerbosity::kDETAILED to get more verbose layer information.
[01/02/2025-15:31:00] [I] Layers:
/deconv/ConvTranspose

Bindings:
onnx::ConvTranspose_0
3
[01/02/2025-15:31:00] [I] Starting inference
[01/02/2025-15:31:03] [I] Warmup completed 4851 queries over 200 ms
[01/02/2025-15:31:03] [I] Timing trace has 69669 queries over 3.00006 s
[01/02/2025-15:31:03] [I] 
[01/02/2025-15:31:03] [I] === Trace details ===
[01/02/2025-15:31:03] [I] Trace averages of 10 runs:
[01/02/2025-15:31:03] [I] Average on 10 runs - GPU latency: 0.0113556 ms - Host latency: 0.0203979 ms (enqueue 0.0102341 ms)
[01/02/2025-15:31:03] [I] === Performance summary ===
[01/02/2025-15:31:03] [I] Throughput: 23222.6 qps
[01/02/2025-15:31:03] [I] Latency: min = 0.0184937 ms, max = 2.60498 ms, mean = 0.0215183 ms, median = 0.0217285 ms, percentile(90%) = 0.0234375 ms, percentile(95%) = 0.0236816 ms, percentile(99%) = 0.0246582 ms
[01/02/2025-15:31:03] [I] Enqueue Time: min = 0.00976562 ms, max = 0.104248 ms, mean = 0.0110019 ms, median = 0.0109863 ms, percentile(90%) = 0.0114746 ms, percentile(95%) = 0.0115967 ms, percentile(99%) = 0.012207 ms
[01/02/2025-15:31:03] [I] H2D Latency: min = 0.00366211 ms, max = 0.0305176 ms, mean = 0.00459977 ms, median = 0.00454712 ms, percentile(90%) = 0.00476074 ms, percentile(95%) = 0.00488281 ms, percentile(99%) = 0.00537109 ms
[01/02/2025-15:31:03] [I] GPU Compute Time: min = 0.0100098 ms, max = 0.116699 ms, mean = 0.0124901 ms, median = 0.0124512 ms, percentile(90%) = 0.0144043 ms, percentile(95%) = 0.0144043 ms, percentile(99%) = 0.0153809 ms
[01/02/2025-15:31:03] [I] D2H Latency: min = 0.00366211 ms, max = 2.58923 ms, mean = 0.00442818 ms, median = 0.00439453 ms, percentile(90%) = 0.00463867 ms, percentile(95%) = 0.00476074 ms, percentile(99%) = 0.00491333 ms
[01/02/2025-15:31:03] [I] Total Host Walltime: 3.00006 s
[01/02/2025-15:31:03] [I] Total GPU Compute Time: 0.870173 s
[01/02/2025-15:31:03] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[01/02/2025-15:31:03] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[01/02/2025-15:31:03] [W] * GPU compute time is unstable, with coefficient of variance = 12.816%.
[01/02/2025-15:31:03] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[01/02/2025-15:31:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/02/2025-15:31:03] [V] 
[01/02/2025-15:31:03] [V] === Explanations of the performance metrics ===
[01/02/2025-15:31:03] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[01/02/2025-15:31:03] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[01/02/2025-15:31:03] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/02/2025-15:31:03] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/02/2025-15:31:03] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[01/02/2025-15:31:03] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[01/02/2025-15:31:03] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[01/02/2025-15:31:03] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[01/02/2025-15:31:03] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100600] [b18] # ./TensorRT-10.6.0.18/bin/trtexec --onnx=test_outpad.onnx --dumpLayerInfo --verbose
@chohk88
Copy link
Collaborator Author

chohk88 commented Jan 16, 2025

I’ve implemented handling for output_padding using pre_padding and post_padding in IDeconvolutionLayer. To make it easier to understand, I’ve included examples below that show how it works in both Torch and TensorRT. The key idea is that by setting the right values for pre_padding and post_padding, we can replicate the behavior of Torch’s output_padding in TensorRT. You can check out the linked PR for the full code.

Torch example output

In Torch, the effect of output_padding can be seen as follows:

import torch
import torch.nn as nn

def test_deconv_1d(input_vals, pad, outpad):
    try:
        x = torch.tensor(input_vals, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
        deconv = nn.ConvTranspose1d(
            in_channels=1,
            out_channels=1,
            kernel_size=2,
            stride=2,
            padding=pad,
            output_padding=outpad,
            bias=False
        )
        with torch.no_grad():
            deconv.weight[:] = 1.0
        y = deconv(x)
        return f"pad={pad}, outpad={outpad} => Output: {y.squeeze().tolist()}"
    except RuntimeError as e:
        return f"pad={pad}, outpad={outpad} => Error: {e}"

input_data = [1.0, 2.0, 3.0]
cases = [(0, 0), (1, 0), (1, 1), (2, 0), (2, 1), (2, 2)]

for p, outp in cases:
    print(test_deconv_1d(input_data, p, outp))

For input=[1.0, 2.0, 3.0], kernel_size=2, stride=2:

pad output_padding Output
0 0 [1.0, 1.0, 2.0, 2.0, 3.0, 3.0]
1 0 [1.0, 2.0, 2.0, 3.0]
1 1 [1.0, 2.0, 2.0, 3.0, 3.0]
2 0 [2.0, 2.0]
2 1 [2.0, 2.0, 3.0]
2 2 Error: output padding must be smaller than either stride or dilation...

TensorRT IDeconvolutionLayer example output

Using TensorRT's IDeconvolutionLayer, we can achieve equivalent behavior by setting pre_padding and post_padding directly.

For example:

pre_padding post_padding Output
0 0 [1.0, 1.0, 2.0, 2.0, 3.0, 3.0]
1 1 [1.0, 2.0, 2.0, 3.0]
1 0 [1.0, 2.0, 2.0, 3.0, 3.0]
2 2 [2.0, 2.0]
2 1 [2.0, 2.0, 3.0]
2 0 [2.0, 2.0, 3.0, 3.0]

As seen above, the output matches the behavior in Torch when the appropriate pre_padding and post_padding values are used in TensorRT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant