Skip to content

Lab-of-AI-and-Robotics/LSeg_Image_Encoder_TensorRT

Repository files navigation

Real-time LSeg Image Encoder – ONNX & TensorRT

Author: Joonyeol Choi

Support: Dr. Jihoon Moon

End‑to‑end pipeline for converting the LSeg image encoder to ONNX / TensorRT, benchmarking PyTorch ↔ TRT speed, and verifying numerical fidelity.

This project supports two types of LSeg image encoder backbones, allowing experimental comparison of their performance:

  • ViT-L/16 (non-ZS)
    • Conversion script: conversion/model_to_onnx.py
    • Example ONNX filename: lseg_img_enc_vit_demo_e200.onnx
  • ResNet101 (ZS-Variant)
    • Conversion script: conversion/model_to_onnx_zs.py
    • Example ONNX filename: lseg_img_enc_rn101_fss_rn101.onnx

You can subsequently use the TRT conversion script (conversion/onnx_to_trt.py) to generate TensorRT engines and conduct comparison experiments.


0. TL;DR  (run everything)

# (1) install python deps
pip install -r requirements.txt  # CUDA / OpenCV must already be available

# (2) build both C++ projects in one shot
make             # or   make -j12

# (3) optional – run the latency benchmark
python3 inferenceTimeTester.py  \
  --weights models/weights/demo_e200.ckpt \
  --img_sizes 260 390 520 650 780 910

# (4) run the full feature‑comparison pipeline
bash python_trt_comp/run_feature_comparison.sh

  • make clean → removes all CPP_Project/**/build directories + temporary CMake artefacts.
  • The root‑level Makefile is just a thin wrapper that cmake --build’s each sub‑directory – it does not introduce any extra dependencies.

1. Installation

  1. System Package
sudo apt update && sudo apt install -y \
    python3-pip python3-dev \
    libopencv-dev \
    libprotobuf-dev protobuf-compiler \
    libtinfo5 \
    libopenmpi-dev \
    cuda-toolkit-##  # CUDA Version : At least Minimum Requirement for TensorRT 10.9
  1. Python ≥3.8   pip install -r requirements.txt
  2. CUDA + TensorRT 10.9 already installed (the repo never calls the TRT builder directly – it simply links against the headers/lib).
  3. Optional but recommended: opencv-dev for the C++ extractor.

2. Building

2‑a. One‑shot build (recommended)

# from project root
make -j$(nproc)        # builds …
                      #   • CPP_Project/Inference_Time_Tester
                      #   • CPP_Project/Feature_Extractor

The helper Makefile simply iterates through every CPP_Project/*/CMakeLists.txt, wipes the old build/ directory, configures with cmake -S . -B build, then invokes the native generator.

2‑b. Per‑project build (legacy)

cd CPP_Project/Inference_Time_Tester && cmake -B build -S . && cmake --build build -j
cd CPP_Project/Feature_Extractor     && cmake -B build -S . && cmake --build build -j

3. Project Structure

LSeg_Image_Encoder_TensorRT/
│
├── CPP_Project/                       # C++ projects
│   ├── Feature_Extractor/             # Feature extractor project
│   │   ├── CMakeLists.txt             # CMake configuration
│   │   └── main.cpp                   # Feature extractor main code
│   ├── Inference_Time_Tester/         # Inference benchmarking project
│   │   ├── CMakeLists.txt             # CMake configuration
│   │   └── main.cpp                   # Benchmarking main code
│   └── third_party/                   # Third-party libraries
│       └── cnpy/                      # CNpy submodule for numpy I/O
│
├── Visual_Demo/                       # Demo scripts and results
│   ├── demo.sh                        # demo.sh script
│   ├── demo.py                        # Python wrapper for demo.py
│   ├── demo_wordFree.sh               # demo_wordFree.sh script
│   ├── demo_wordFree.py               # Python wrapper for demo_wordFree.py
│   └── images/                        # Visualization results and input images
│       ├── Dog_grass_demo.png         # Example segmentation result
│       ├── Dog_grass_wordFree.png     # Example word-free segmentation result
│       └── dog_grass.jpeg             # Example input image
│
├── models/
│   ├── weights/
│   │   ├── ViT/
│   │   │   ├── demo_e200.ckpt         # ViT-L/16 CLIP checkpoint
│   │   │   └── fss_l16.ckpt           # FSS-trained ViT model
│   │   └── Resnet/
│   │       ├── coco_fold1.ckpt        # ResNet-ZS custom model
│   │       ├── fss_rn101.ckpt         # ResNet-ZS FSS variant
│   │       └── pascal_fold1.ckpt      # ResNet-ZS custom model
│   ├── onnx_engines/
│   │   ├── lseg_img_enc_vit_demo_e200.onnx
│   │   ├── lseg_img_enc_vit_fss_l16.onnx
│   │   ├── lseg_img_enc_rn101_coco_fold1.onnx
│   │   ├── lseg_img_enc_rn101_fss_rn101.onnx
│   │   └── lseg_img_enc_rn101_pascal_fold1.onnx
│   └── trt_engines/
│       └── <...>.trt                  # Auto-generated TensorRT engines
│
├── modules/                           # LSeg model-related source code
│   ├── lseg_module.py                 # LSegModule: wraps image encoder + head
│   ├── lseg_full.py                   # LSegFull: complete network (encoder + decoder)
│   ├── models/                        # Internal submodules
│       ├── lseg_blocks.py             # RefineNet blocks and skip-connections
│       ├── lseg_net.py                # Network assembly utilities
│       └── lseg_vit.py                # CLIP ViT layer partitioning and feature extraction
│
├── conversion/                        # Model conversion scripts
│   ├── model_to_onnx.py               # ViT‐L/16 → ONNX
│   ├── model_to_onnx_zs.py            # ResNet101-ZS → ONNX
│   └── onnx_to_trt.py                 # ONNX → TensorRT (common)
│
├── python_trt_comp/                   # Python-based feature comparison scripts
│   ├── compare_features.py            # Compare feature maps (Cosine/L2)
│   ├── compare_inputs.py              # Compare input tensors
│   ├── model_output.py                # PyTorch feature extraction script
│   └── run_feature_comparison.sh      # Run the complete comparison pipeline
│
├── inferenceTimeTester.py             # Main inference benchmarking script (root directory)
│
├── requirements.txt                   # Python package list
├── Makefile                           # One-shot builder wrapper
└── README.md                          # This file


4. Model Download

Download weights from the official LSeg repository: https://github.com/isl-org/lang-seg

pip install gdown
# Main ViT-L/16 model (demo_e200.ckpt)
gdown 'https://drive.google.com/uc?id=1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7'

# FSS-based models
# fss_rn101.ckpt (ResNet101)
gdown 'https://drive.google.com/uc?id=1UIj49Wp1mAopPub5M6O4WW-Z79VB1bhw'
# fss_l16.ckpt (ViT-L/16)
gdown 'https://drive.google.com/uc?id=1Nplkc_JsHIS55d--K2vonOOC3HrppzYy'

Save the downloaded checkpoints under models/weights/.


5. Building ONNX & TensorRT Engines

5-a. ONNX Model Conversion

  • ViT Backbone

      python3 conversion/model_to_onnx.py \
        --weights models/weights/ViT/demo_e200.ckpt
    • --weights: Checkpoint address

    models/onnx_engines/lseg_img_enc_vit_demo_e200.onnx

  • ResNet-ZS Backbone

      python3 conversion/model_to_onnx_zs.py \
        --weights models/weights/Resnet/fss_rn101.ckpt
    • --weights: Checkpoint address

    models/onnx_engines/lseg_img_enc_rn101_fss_rn101.onnx

5-b. TensorRT Engine Conversion

Caution: The TensorRT conversion depends on your GPU and environment. Therefore, it must be performed on the target device.

python3 conversion/onnx_to_trt.py \
  --onnx models/onnx_engines/<base>.onnx \
  --workspace 1073741824 \
  --fp16 \
  --sparse \
  --disable-timing-cache \
  --gpu-fallback \
  --debug
  • Both ViT and ResNet-ZS commonly uses onnx_to_trt.py.

  • Generated engines will be saved in models/trt_engines/.

TensorRT Options

Option Type Default Description
--onnx <PATH> required Path to input ONNX file
--workspace <BYTE> integer 1<<29 Builder workspace memory in bytes
--fp16 / --no-fp16 flag true Enable or disable FP16 precision
--sparse / --no-sparse flag true Enable or disable sparse weights
--disable-timing-cache flag false Disable timing cache (↑ stability, ↓ speed)
--gpu-fallback flag false Allow GPU fallback in INT8 mode
--debug flag false Enable debug logging

Engine filename auto-generation rule:
base__<option1>_<option2>_..._<wsXXMiB>.trt


6. Latency Benchmark

Run inference/inferenceTimeTester.py to benchmark the latency of PyTorch, ONNX, TensorRT.

python3 inferenceTimeTester.py \
  --weights_dir models/weights \
  --img_sizes 256 320 384 480 640 768 1024 \
  --iterations 1000 \
  --trt_fp16 --trt_sparse --trt_no_tc --trt_gpu_fb --trt_debug \
  --trt_workspace 1073741824
  • --weights_dir models/weights

    • ViT non-ZS Model : .ckpt in the ViT/

    • ResNet-ZS Model : .ckpt in the Resnet/

  • --img_sizes: List of input sizes for benchmarking
  • --iterations: Number of iterations
  • --trt_*: TRT Build Options (Automatically apply to ONNX→TRT)

Script Behavior:

  1. Automatically generates ONNX file if missing.
  2. Automatically generates TensorRT engine if missing.
  3. Performs inference benchmarking in the order: PyTorch → ONNX → TRT.

Example Result:

  • Results summarized by Backbone, Checkpoint, and Size in Avg(ms) ± Std(ms) table format:
    [RESULT] PyTorch Avg: 12.345 ms ± 0.123 ms
    [RESULT] ONNX   Avg: 10.567 ms ± 0.098 ms
    [RESULT] TRT    Avg:  5.432 ms ± 0.045 ms
    

6‑a. Test rig

AMD Ryzen 7 9700X  | 8C / 16T @ 5.0 GHz
NVIDIA RTX 4090    | 24 GB (Ada, 550 W limit)
64 GB DDR5‑6000    | dual‑rank
TensorRT 10.9 + CUDA 12.2, PyTorch 2.3 (cu118)
Ubuntu 22.04 LTS   | Linux 6.5

Hardware script (hardware_spec.sh) dumps the table automatically.

6‑b. Results

inferenceTimeTester.py --iterations 1000

ResNet101-ZS

Size PyTorch ms ± TRT-Python ms ± TRT-C++ ms ±
256 3.73 0.11 3.23 0.05 1.60 0.15
320 4.33 0.32 4.27 0.04 1.71 0.15
384 5.23 0.46 5.60 0.04 1.92 0.15
480 6.80 0.63 8.34 0.18 2.49 0.31
640 10.93 1.02 14.54 0.23 4.12 0.28
768 15.99 1.28 21.16 0.22 6.17 0.36
1024 27.23 2.32 37.34 0.32 10.49 0.34

ViT-L/16 (non-ZS)

Size PyTorch ms ± TRT-Python ms ± TRT-C++ ms ±
256 10.57 1.09 5.55 0.12 3.89 0.25
320 14.03 1.50 6.87 0.15 4.60 0.30
384 20.41 1.93 8.55 0.35 4.71 0.39
480 28.30 2.58 11.63 0.29 5.78 0.24
640 57.76 4.78 21.70 0.34 10.44 0.46
768 79.41 6.04 31.13 1.84 15.77 0.60
1024 173.31 12.30 58.65 1.42 30.85 0.66

Observations

[TensorRT Optimization]

  • TensorRT ( Python API ) already yields a 2 – 3× speed‑up over eager PyTorch.
  • The minimalist C++ runner shaves another ~40 % latency, dominated by
    • avoiding pycuda / DLPack marshalling overheads;
    • pre‑parsing I/O tensor indices at start‑up.
  • Slope ≈ O(N²) w.r.t spatial resolution (expected for ViT windowed attention).

[Backbone Image Encoder]

  • ResNet-ZS vs ViT (PyTorch eager)
    • ResNet-ZS is ~2.8× faster at 256² (3.7 ms vs 10.6 ms) and the gap widens to ~6.4× at 1024² (27.2 ms vs 173.3 ms).
  • ResNet-ZS vs ViT (TRT-Python)
    • Speed-up is milder (≈1.3–1.5×), e.g. 3.2 ms vs 5.6 ms at 256², and 37.3 ms vs 58.6 ms at 1024².
  • ResNet-ZS vs ViT (TRT-C++)
    • C++ runner further reduces latency by ~35–40 %; ResNet-ZS: 1.6 ms→ vs ViT: 3.9 ms at 256².
  • Overall
    • ResNet-ZS offers much lower absolute latency across all APIs, while ViT’s heavier computation makes its acceleration benefits more dramatic under TensorRT.

7. Demo Scripts

Visual_Demo/demo.sh

This script performs segmentation on a given image using the ONNX model and visualizes the results.

# Example usage (run from root directory)
python3 Visual_Demo/demo.py --image Visual_Demo/images/dog_grass.jpeg \
                            --labels "dog, grass, other" \
                            --onnx models/onnx_engines/lseg_img_enc_vit_ade20k.onnx \
                            --size 384
  • --image: Path to input image
  • --labels: Comma-separated label list (e.g., "cat, sky, building")
  • --onnx: Path to ONNX model file
  • --size: Input size for the model (HxW)

Internally, the script calls demo.py, displaying the original image on the left and the segmentation result on the right.

Visual_Demo/demo_wordFree.sh

This script performs pixel-level classification using the full CLIP vocabulary and prints the identified words to the console while visualizing the results.

# Example usage (run from root directory)
python3 Visual_Demo/demo_wordFree.py --image Visual_Demo/images/dog_grass.jpeg \
                                     --onnx models/onnx_engines/lseg_img_enc_vit_ade20k.onnx \
                                     --size 384
  • --image: Path to input image
  • --onnx: Path to ONNX model file
  • --size: Input size for the model (HxW)

Internally, the script calls demo_wordFree.py, selecting the most similar token from the full CLIP vocabulary for each pixel and visualizing the results while printing the identified words.

Visual Results

Below are example results saved in the Visual_demo/images/ directory:

Segmentation (demo.py) Word-free (demo_wordFree.py)
Dog Grass Segmentation Dog Grass WordFree

8 Feature-map Extraction & Comparison

This chapter validates that our FP16, sparse-kernel TensorRT engines remain numerically faithful to the original PyTorch checkpoints and that semantic relationships between feature maps are preserved across back-ends and weights.

Script / Binary Role Runtime
python_trt_comp/model_output.py Load an LSeg checkpoint, drop the decoder, run the encoder only, dump a (B, 512, H/2, W/2) feature tensor as *.npy. PyTorch + CUDA
CPP_Project/Feature_Extractor/build/trt_feature_extractor Deserialise the dynamic-shape TensorRT engine, feed a BGR image, emit an identical tensor. C++ / TensorRT
python_trt_comp/compare_features.py Flatten PyTorch vs TensorRT tensors and report cosine similarity + L2 norm. Python (CPU)
python_trt_comp/run_feature_comparison.sh Glue script that loops over images × checkpoints × resolutions. Bash
bash python_trt_comp/run_feature_comparison.sh
# ➜ results appear under outputs/ and as console logs

8-a Numerical Fidelity — PyTorch vs TensorRT

(RTX 4090 · TensorRT 10.9 · FP16 engines with sparse weights)

Backbone Weight µ Cosine ↑ σ Cosine µ L2 ↓ σ L2 min L2 / max L2
ResNet-50 coco_fold1 1.0027 0.0018 1.58 0.68 0.70 / 2.76
fss_rn101 1.0029 0.0023 6.51 2.97 2.77 / 12.23
pascal_fold1 1.0020 0.0014 2.93 0.97 1.60 / 5.14
ViT-B/16 demo_e200 1.0019 0.0014 2.16 1.80 0.38 / 5.66
fss_l16 1.0037 0.0021 2.79 1.00 1.79 / 4.96

Interpretation

  1. Cosine similarity is essentially unity (≥ 0.999) for all 40 image–size combinations we tested, meaning < 0.2 % angular error after FP16 quantisation, structural sparsity, kernel fusion and Winograd re-ordering.

  2. The ResNet coco_fold1 model gives the tightest L2 spread (median ≈ 1.5); fss_rn101 is deliberately trained on few-shot masks and therefore exhibits higher magnitude feature activations, which inflates L2 while leaving angle intact.

  3. ViT-based engines track PyTorch within ±0.002 cosine / ±0.05 σ — negligible for retrieval or segmentation tasks.


8-b Cross-Tag Feature-Map Similarity

(ade20k tag vs fss tag, averaged over cat, cat2, cat3)

Backbone Size (px) Cosine (PT) Cosine (TRT) Δ L2 (PT) L2 (TRT) Δ
ResNet-50 480 –0.0403 –0.0411 8 e-4 318.1 318.4 0.3
320 –0.0189 –0.0194 5 e-4 250.7 250.9 0.2
ViT-B/16 480 –0.0257 –0.0258 1 e-4 343.6 343.7 0.1
320 –0.0041 –0.0047 6 e-4 226.7 226.8 0.1

Interpretation
Negative cosine values confirm that the aggregate embeddings for ade20k and fss are near-orthogonal, signalling that the two label sets live in distinctly separated sub-spaces.
TensorRT reproduces PyTorch within |Δ cos| ≤ 0.0008 and |Δ L2| ≤ 0.3, well below intra-dataset variance.
ResNet shows slightly stronger orthogonality (–0.04 vs –0.026) because its convolutional filters are less text-conditioned than ViT’s global token mixer.


8-c ViT vs ResNet — Backbone-level Trends

Metric ViT-B/16 (demo + fss) ResNet-50 (3 weights) Observation
Mean Cosine PT ↔ TRT 1.0028 1.0025 Both back-ends < 0.2 % angular drift.
Worst-case L2 5.66 12.23 ResNet sees higher L2 due to fss_rn101’s large activations.
Cross-Tag Cosine –0.025 (±0.010) –0.033 (±0.012) ResNet features are slightly more orthogonal across tags.
Encoder FLOPs 12.4 G 9.7 G ViT costs more but benefits from parallel friendly GEMMs.
TensorRT FPS (224², BS = 1) 1030 1180 ResNet leverages sparsity better; ViT still exceeds 1 kfps.

Take-away — Choosing between ViT and ResNet is workload-dependent:
ViT delivers denser, more isotropic language–vision embeddings ideal for prompt-tuning, whereas ResNet provides leaner, more localized features that compress well and run faster on sparse tensors.


8-d Visual Inspection

PyTorch vs TensorRT (ViT demo_e200) Size × Weight Heat-Map Matrix

Left: every pixel overlay shows absolute difference < 0.015, matching the tabular metrics.
Right: heat-maps confirm that spatial saliency is preserved across (256–480 px) and all five checkpoints — brighter zones overlap exactly between PyTorch and TensorRT.


Concluding Remarks for Section 8

The combined quantitative (cosine, L2) and qualitative (heat-map) analyses demonstrate that our FP16, sparse TensorRT pipelines replicate PyTorch encoders with sub-percent error, regardless of backbone, resolution or training corpus. This guarantees drop-in replacement for downstream tasks such as zero-shot segmentation, CLIP-style retrieval and long-horizon robot planning.


9. Additional Notes

  • Using ONNX opset_version=14
  • Supports dynamic input size: Refer to torch.onnx.export(... dynamic_axes=...) setting
  • Requires onnxruntime-gpu for GPU benchmarking:
    pip install onnxruntime-gpu
  • Verify CUDAExecutionProvider:
import onnxruntime as ort
print(ort.get_available_providers())

10. License

MIT – see LICENSE for details.


11. Acknowledgements

Portions of the code are adapted from ISL‑org / lang‑seg (Apache‑2.0) and NVIDIA TensorRT samples.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •