Real-time LSeg Image Encoder – ONNX & TensorRT

Author: Joonyeol Choi

Support: Dr. Jihoon Moon

✨ End‑to‑end pipeline for converting the LSeg image encoder to ONNX / TensorRT, benchmarking PyTorch ↔ TRT speed, and verifying numerical fidelity.

This project supports two types of LSeg image encoder backbones, allowing experimental comparison of their performance:

ViT-L/16 (non-ZS)
- Conversion script: conversion/model_to_onnx.py
- Example ONNX filename: lseg_img_enc_vit_demo_e200.onnx
ResNet101 (ZS-Variant)
- Conversion script: conversion/model_to_onnx_zs.py
- Example ONNX filename: lseg_img_enc_rn101_fss_rn101.onnx

You can subsequently use the TRT conversion script (conversion/onnx_to_trt.py) to generate TensorRT engines and conduct comparison experiments.

0. TL;DR (run everything)

# (1) install python deps
pip install -r requirements.txt  # CUDA / OpenCV must already be available

# (2) build both C++ projects in one shot
make             # or   make -j12

# (3) optional – run the latency benchmark
python3 inferenceTimeTester.py  \
  --weights models/weights/demo_e200.ckpt \
  --img_sizes 260 390 520 650 780 910

# (4) run the full feature‑comparison pipeline
bash python_trt_comp/run_feature_comparison.sh

make clean → removes all CPP_Project/**/build directories + temporary CMake artefacts.
The root‑level Makefile is just a thin wrapper that cmake --build’s each sub‑directory – it does not introduce any extra dependencies.

1. Installation

System Package

sudo apt update && sudo apt install -y \
    python3-pip python3-dev \
    libopencv-dev \
    libprotobuf-dev protobuf-compiler \
    libtinfo5 \
    libopenmpi-dev \
    cuda-toolkit-##  # CUDA Version : At least Minimum Requirement for TensorRT 10.9

Python ≥3.8 pip install -r requirements.txt
CUDA + TensorRT 10.9 already installed (the repo never calls the TRT builder directly – it simply links against the headers/lib).
Optional but recommended: opencv-dev for the C++ extractor.

2. Building

2‑a. One‑shot build (recommended)

# from project root
make -j$(nproc)        # builds …
                      #   • CPP_Project/Inference_Time_Tester
                      #   • CPP_Project/Feature_Extractor

The helper Makefile simply iterates through every CPP_Project/*/CMakeLists.txt, wipes the old build/ directory, configures with cmake -S . -B build, then invokes the native generator.

2‑b. Per‑project build (legacy)

cd CPP_Project/Inference_Time_Tester && cmake -B build -S . && cmake --build build -j
cd CPP_Project/Feature_Extractor     && cmake -B build -S . && cmake --build build -j

3. Project Structure

LSeg_Image_Encoder_TensorRT/
│
├── CPP_Project/                       # C++ projects
│   ├── Feature_Extractor/             # Feature extractor project
│   │   ├── CMakeLists.txt             # CMake configuration
│   │   └── main.cpp                   # Feature extractor main code
│   ├── Inference_Time_Tester/         # Inference benchmarking project
│   │   ├── CMakeLists.txt             # CMake configuration
│   │   └── main.cpp                   # Benchmarking main code
│   └── third_party/                   # Third-party libraries
│       └── cnpy/                      # CNpy submodule for numpy I/O
│
├── Visual_Demo/                       # Demo scripts and results
│   ├── demo.sh                        # demo.sh script
│   ├── demo.py                        # Python wrapper for demo.py
│   ├── demo_wordFree.sh               # demo_wordFree.sh script
│   ├── demo_wordFree.py               # Python wrapper for demo_wordFree.py
│   └── images/                        # Visualization results and input images
│       ├── Dog_grass_demo.png         # Example segmentation result
│       ├── Dog_grass_wordFree.png     # Example word-free segmentation result
│       └── dog_grass.jpeg             # Example input image
│
├── models/
│   ├── weights/
│   │   ├── ViT/
│   │   │   ├── demo_e200.ckpt         # ViT-L/16 CLIP checkpoint
│   │   │   └── fss_l16.ckpt           # FSS-trained ViT model
│   │   └── Resnet/
│   │       ├── coco_fold1.ckpt        # ResNet-ZS custom model
│   │       ├── fss_rn101.ckpt         # ResNet-ZS FSS variant
│   │       └── pascal_fold1.ckpt      # ResNet-ZS custom model
│   ├── onnx_engines/
│   │   ├── lseg_img_enc_vit_demo_e200.onnx
│   │   ├── lseg_img_enc_vit_fss_l16.onnx
│   │   ├── lseg_img_enc_rn101_coco_fold1.onnx
│   │   ├── lseg_img_enc_rn101_fss_rn101.onnx
│   │   └── lseg_img_enc_rn101_pascal_fold1.onnx
│   └── trt_engines/
│       └── <...>.trt                  # Auto-generated TensorRT engines
│
├── modules/                           # LSeg model-related source code
│   ├── lseg_module.py                 # LSegModule: wraps image encoder + head
│   ├── lseg_full.py                   # LSegFull: complete network (encoder + decoder)
│   ├── models/                        # Internal submodules
│       ├── lseg_blocks.py             # RefineNet blocks and skip-connections
│       ├── lseg_net.py                # Network assembly utilities
│       └── lseg_vit.py                # CLIP ViT layer partitioning and feature extraction
│
├── conversion/                        # Model conversion scripts
│   ├── model_to_onnx.py               # ViT‐L/16 → ONNX
│   ├── model_to_onnx_zs.py            # ResNet101-ZS → ONNX
│   └── onnx_to_trt.py                 # ONNX → TensorRT (common)
│
├── python_trt_comp/                   # Python-based feature comparison scripts
│   ├── compare_features.py            # Compare feature maps (Cosine/L2)
│   ├── compare_inputs.py              # Compare input tensors
│   ├── model_output.py                # PyTorch feature extraction script
│   └── run_feature_comparison.sh      # Run the complete comparison pipeline
│
├── inferenceTimeTester.py             # Main inference benchmarking script (root directory)
│
├── requirements.txt                   # Python package list
├── Makefile                           # One-shot builder wrapper
└── README.md                          # This file

4. Model Download

Download weights from the official LSeg repository: https://github.com/isl-org/lang-seg

pip install gdown
# Main ViT-L/16 model (demo_e200.ckpt)
gdown 'https://drive.google.com/uc?id=1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7'

# FSS-based models
# fss_rn101.ckpt (ResNet101)
gdown 'https://drive.google.com/uc?id=1UIj49Wp1mAopPub5M6O4WW-Z79VB1bhw'
# fss_l16.ckpt (ViT-L/16)
gdown 'https://drive.google.com/uc?id=1Nplkc_JsHIS55d--K2vonOOC3HrppzYy'

Save the downloaded checkpoints under models/weights/.

5. Building ONNX & TensorRT Engines

5-a. ONNX Model Conversion

ViT Backbone
```
  python3 conversion/model_to_onnx.py \
    --weights models/weights/ViT/demo_e200.ckpt
```
- --weights: Checkpoint address
→ models/onnx_engines/lseg_img_enc_vit_demo_e200.onnx
ResNet-ZS Backbone
```
  python3 conversion/model_to_onnx_zs.py \
    --weights models/weights/Resnet/fss_rn101.ckpt
```
- --weights: Checkpoint address
→ models/onnx_engines/lseg_img_enc_rn101_fss_rn101.onnx

5-b. TensorRT Engine Conversion

Caution: The TensorRT conversion depends on your GPU and environment. Therefore, it must be performed on the target device.

python3 conversion/onnx_to_trt.py \
  --onnx models/onnx_engines/<base>.onnx \
  --workspace 1073741824 \
  --fp16 \
  --sparse \
  --disable-timing-cache \
  --gpu-fallback \
  --debug

Both ViT and ResNet-ZS commonly uses onnx_to_trt.py.
Generated engines will be saved in models/trt_engines/.

TensorRT Options

Option	Type	Default	Description
`--onnx <PATH>`	required	—	Path to input ONNX file
`--workspace <BYTE>`	integer	`1<<29`	Builder workspace memory in bytes
`--fp16` / `--no-fp16`	flag	true	Enable or disable FP16 precision
`--sparse` / `--no-sparse`	flag	true	Enable or disable sparse weights
`--disable-timing-cache`	flag	false	Disable timing cache (↑ stability, ↓ speed)
`--gpu-fallback`	flag	false	Allow GPU fallback in INT8 mode
`--debug`	flag	false	Enable debug logging

Engine filename auto-generation rule:
base__<option1>_<option2>_..._<wsXXMiB>.trt

6. Latency Benchmark

Run inference/inferenceTimeTester.py to benchmark the latency of PyTorch, ONNX, TensorRT.

python3 inferenceTimeTester.py \
  --weights_dir models/weights \
  --img_sizes 256 320 384 480 640 768 1024 \
  --iterations 1000 \
  --trt_fp16 --trt_sparse --trt_no_tc --trt_gpu_fb --trt_debug \
  --trt_workspace 1073741824

--weights_dir models/weights
- ViT non-ZS Model : .ckpt in the ViT/
- ResNet-ZS Model : .ckpt in the Resnet/

--img_sizes: List of input sizes for benchmarking
--iterations: Number of iterations
--trt_*: TRT Build Options (Automatically apply to ONNX→TRT)

Script Behavior:

Automatically generates ONNX file if missing.
Automatically generates TensorRT engine if missing.
Performs inference benchmarking in the order: PyTorch → ONNX → TRT.

Example Result:

Results summarized by Backbone, Checkpoint, and Size in Avg(ms) ± Std(ms) table format:

[RESULT] PyTorch Avg: 12.345 ms ± 0.123 ms
[RESULT] ONNX   Avg: 10.567 ms ± 0.098 ms
[RESULT] TRT    Avg:  5.432 ms ± 0.045 ms

6‑a. Test rig

AMD Ryzen 7 9700X  | 8C / 16T @ 5.0 GHz
NVIDIA RTX 4090    | 24 GB (Ada, 550 W limit)
64 GB DDR5‑6000    | dual‑rank
TensorRT 10.9 + CUDA 12.2, PyTorch 2.3 (cu118)
Ubuntu 22.04 LTS   | Linux 6.5

Hardware script (hardware_spec.sh) dumps the table automatically.

6‑b. Results

inferenceTimeTester.py --iterations 1000

ResNet101-ZS

Size	PyTorch ms	±	TRT-Python ms	±	TRT-C++ ms	±
256	3.73	0.11	3.23	0.05	1.60	0.15
320	4.33	0.32	4.27	0.04	1.71	0.15
384	5.23	0.46	5.60	0.04	1.92	0.15
480	6.80	0.63	8.34	0.18	2.49	0.31
640	10.93	1.02	14.54	0.23	4.12	0.28
768	15.99	1.28	21.16	0.22	6.17	0.36
1024	27.23	2.32	37.34	0.32	10.49	0.34

ViT-L/16 (non-ZS)

Size	PyTorch ms	±	TRT-Python ms	±	TRT-C++ ms	±
256	10.57	1.09	5.55	0.12	3.89	0.25
320	14.03	1.50	6.87	0.15	4.60	0.30
384	20.41	1.93	8.55	0.35	4.71	0.39
480	28.30	2.58	11.63	0.29	5.78	0.24
640	57.76	4.78	21.70	0.34	10.44	0.46
768	79.41	6.04	31.13	1.84	15.77	0.60
1024	173.31	12.30	58.65	1.42	30.85	0.66

Observations

[TensorRT Optimization]

TensorRT ( Python API ) already yields a 2 – 3× speed‑up over eager PyTorch.
The minimalist C++ runner shaves another ~40 % latency, dominated by
- avoiding pycuda / DLPack marshalling overheads;
- pre‑parsing I/O tensor indices at start‑up.
Slope ≈ O(N²) w.r.t spatial resolution (expected for ViT windowed attention).

[Backbone Image Encoder]

ResNet-ZS vs ViT (PyTorch eager)
- ResNet-ZS is ~2.8× faster at 256² (3.7 ms vs 10.6 ms) and the gap widens to ~6.4× at 1024² (27.2 ms vs 173.3 ms).
ResNet-ZS vs ViT (TRT-Python)
- Speed-up is milder (≈1.3–1.5×), e.g. 3.2 ms vs 5.6 ms at 256², and 37.3 ms vs 58.6 ms at 1024².
ResNet-ZS vs ViT (TRT-C++)
- C++ runner further reduces latency by ~35–40 %; ResNet-ZS: 1.6 ms→ vs ViT: 3.9 ms at 256².
Overall
- ResNet-ZS offers much lower absolute latency across all APIs, while ViT’s heavier computation makes its acceleration benefits more dramatic under TensorRT.

7. Demo Scripts

Visual_Demo/demo.sh

This script performs segmentation on a given image using the ONNX model and visualizes the results.

# Example usage (run from root directory)
python3 Visual_Demo/demo.py --image Visual_Demo/images/dog_grass.jpeg \
                            --labels "dog, grass, other" \
                            --onnx models/onnx_engines/lseg_img_enc_vit_ade20k.onnx \
                            --size 384

--image: Path to input image
--labels: Comma-separated label list (e.g., "cat, sky, building")
--onnx: Path to ONNX model file
--size: Input size for the model (HxW)

Internally, the script calls demo.py, displaying the original image on the left and the segmentation result on the right.

Visual_Demo/demo_wordFree.sh

This script performs pixel-level classification using the full CLIP vocabulary and prints the identified words to the console while visualizing the results.

# Example usage (run from root directory)
python3 Visual_Demo/demo_wordFree.py --image Visual_Demo/images/dog_grass.jpeg \
                                     --onnx models/onnx_engines/lseg_img_enc_vit_ade20k.onnx \
                                     --size 384

--image: Path to input image
--onnx: Path to ONNX model file
--size: Input size for the model (HxW)

Internally, the script calls demo_wordFree.py, selecting the most similar token from the full CLIP vocabulary for each pixel and visualizing the results while printing the identified words.

Visual Results

Below are example results saved in the Visual_demo/images/ directory:

Segmentation (`demo.py`)	Word-free (`demo_wordFree.py`)

8 Feature-map Extraction & Comparison

This chapter validates that our FP16, sparse-kernel TensorRT engines remain numerically faithful to the original PyTorch checkpoints and that semantic relationships between feature maps are preserved across back-ends and weights.

Script / Binary	Role	Runtime
`python_trt_comp/model_output.py`	Load an LSeg checkpoint, drop the decoder, run the encoder only, dump a `(B, 512, H/2, W/2)` feature tensor as `*.npy`.	PyTorch + CUDA
`CPP_Project/Feature_Extractor/build/trt_feature_extractor`	Deserialise the dynamic-shape TensorRT engine, feed a BGR image, emit an identical tensor.	C++ / TensorRT
`python_trt_comp/compare_features.py`	Flatten PyTorch vs TensorRT tensors and report cosine similarity + L2 norm.	Python (CPU)
`python_trt_comp/run_feature_comparison.sh`	Glue script that loops over images × checkpoints × resolutions.	Bash

bash python_trt_comp/run_feature_comparison.sh
# ➜ results appear under outputs/ and as console logs

8-a Numerical Fidelity — PyTorch vs TensorRT

(RTX 4090 · TensorRT 10.9 · FP16 engines with sparse weights)

Backbone	Weight	µ Cosine ↑	σ Cosine	µ L2 ↓	σ L2	min L2 / max L2
ResNet-50	coco_fold1	1.0027	0.0018	1.58	0.68	0.70 / 2.76
	fss_rn101	1.0029	0.0023	6.51	2.97	2.77 / 12.23
	pascal_fold1	1.0020	0.0014	2.93	0.97	1.60 / 5.14
ViT-B/16	demo_e200	1.0019	0.0014	2.16	1.80	0.38 / 5.66
	fss_l16	1.0037	0.0021	2.79	1.00	1.79 / 4.96

Interpretation

Cosine similarity is essentially unity (≥ 0.999) for all 40 image–size combinations we tested, meaning < 0.2 % angular error after FP16 quantisation, structural sparsity, kernel fusion and Winograd re-ordering.

The ResNet coco_fold1 model gives the tightest L2 spread (median ≈ 1.5); fss_rn101 is deliberately trained on few-shot masks and therefore exhibits higher magnitude feature activations, which inflates L2 while leaving angle intact.

ViT-based engines track PyTorch within ±0.002 cosine / ±0.05 σ — negligible for retrieval or segmentation tasks.

8-b Cross-Tag Feature-Map Similarity

(ade20k tag vs fss tag, averaged over cat, cat2, cat3)

Backbone	Size (px)	Cosine (PT)	Cosine (TRT)	Δ	L2 (PT)	L2 (TRT)	Δ
ResNet-50	480	–0.0403	–0.0411	8 e-4	318.1	318.4	0.3
	320	–0.0189	–0.0194	5 e-4	250.7	250.9	0.2
ViT-B/16	480	–0.0257	–0.0258	1 e-4	343.6	343.7	0.1
	320	–0.0041	–0.0047	6 e-4	226.7	226.8	0.1

Interpretation
Negative cosine values confirm that the aggregate embeddings for ade20k and fss are near-orthogonal, signalling that the two label sets live in distinctly separated sub-spaces.
TensorRT reproduces PyTorch within |Δ cos| ≤ 0.0008 and |Δ L2| ≤ 0.3, well below intra-dataset variance.
ResNet shows slightly stronger orthogonality (–0.04 vs –0.026) because its convolutional filters are less text-conditioned than ViT’s global token mixer.

8-c ViT vs ResNet — Backbone-level Trends

Metric	ViT-B/16 (demo + fss)	ResNet-50 (3 weights)	Observation
Mean Cosine PT ↔ TRT	1.0028	1.0025	Both back-ends < 0.2 % angular drift.
Worst-case L2	5.66	12.23	ResNet sees higher L2 due to fss_rn101’s large activations.
Cross-Tag Cosine	–0.025 (±0.010)	–0.033 (±0.012)	ResNet features are slightly more orthogonal across tags.
Encoder FLOPs	12.4 G	9.7 G	ViT costs more but benefits from parallel friendly GEMMs.
TensorRT FPS (224², BS = 1)	1030	1180	ResNet leverages sparsity better; ViT still exceeds 1 kfps.

Take-away — Choosing between ViT and ResNet is workload-dependent:
ViT delivers denser, more isotropic language–vision embeddings ideal for prompt-tuning, whereas ResNet provides leaner, more localized features that compress well and run faster on sparse tensors.

8-d Visual Inspection

PyTorch vs TensorRT (ViT demo_e200)	Size × Weight Heat-Map Matrix

• Left: every pixel overlay shows absolute difference < 0.015, matching the tabular metrics.
• Right: heat-maps confirm that spatial saliency is preserved across (256–480 px) and all five checkpoints — brighter zones overlap exactly between PyTorch and TensorRT.

Concluding Remarks for Section 8

The combined quantitative (cosine, L2) and qualitative (heat-map) analyses demonstrate that our FP16, sparse TensorRT pipelines replicate PyTorch encoders with sub-percent error, regardless of backbone, resolution or training corpus. This guarantees drop-in replacement for downstream tasks such as zero-shot segmentation, CLIP-style retrieval and long-horizon robot planning.

9. Additional Notes

Using ONNX opset_version=14
Supports dynamic input size: Refer to torch.onnx.export(... dynamic_axes=...) setting
Requires onnxruntime-gpu for GPU benchmarking:
```
pip install onnxruntime-gpu
```
Verify CUDAExecutionProvider:

import onnxruntime as ort
print(ort.get_available_providers())

10. License

MIT – see LICENSE for details.

11. Acknowledgements

Portions of the code are adapted from ISL‑org / lang‑seg (Apache‑2.0) and NVIDIA TensorRT samples.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
CPP_Project		CPP_Project
Test_Code_Legacy		Test_Code_Legacy
Visual_Demo		Visual_Demo
conversion		conversion
data		data
modules		modules
python_trt_comp		python_trt_comp
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
inferenceTimeTester.py		inferenceTimeTester.py
requirements.txt		requirements.txt
run_feature_comparison.sh		run_feature_comparison.sh

Lab-of-AI-and-Robotics/LSeg_Image_Encoder_TensorRT

Folders and files

Latest commit

History

Repository files navigation