✨ End‑to‑end pipeline for converting the LSeg image encoder to ONNX / TensorRT, benchmarking PyTorch ↔ TRT speed, and verifying numerical fidelity.
This project supports two types of LSeg image encoder backbones, allowing experimental comparison of their performance:
- ViT-L/16 (non-ZS)
- Conversion script:
conversion/model_to_onnx.py
- Example ONNX filename:
lseg_img_enc_vit_demo_e200.onnx
- Conversion script:
- ResNet101 (ZS-Variant)
- Conversion script:
conversion/model_to_onnx_zs.py
- Example ONNX filename:
lseg_img_enc_rn101_fss_rn101.onnx
- Conversion script:
You can subsequently use the TRT conversion script (conversion/onnx_to_trt.py
) to generate TensorRT engines and conduct comparison experiments.
# (1) install python deps
pip install -r requirements.txt # CUDA / OpenCV must already be available
# (2) build both C++ projects in one shot
make # or make -j12
# (3) optional – run the latency benchmark
python3 inferenceTimeTester.py \
--weights models/weights/demo_e200.ckpt \
--img_sizes 260 390 520 650 780 910
# (4) run the full feature‑comparison pipeline
bash python_trt_comp/run_feature_comparison.sh
make clean
→ removes allCPP_Project/**/build
directories + temporary CMake artefacts.- The root‑level
Makefile
is just a thin wrapper thatcmake --build
’s each sub‑directory – it does not introduce any extra dependencies.
- System Package
sudo apt update && sudo apt install -y \
python3-pip python3-dev \
libopencv-dev \
libprotobuf-dev protobuf-compiler \
libtinfo5 \
libopenmpi-dev \
cuda-toolkit-## # CUDA Version : At least Minimum Requirement for TensorRT 10.9
- Python ≥3.8
pip install -r requirements.txt
- CUDA + TensorRT 10.9 already installed (the repo never calls the TRT builder directly – it simply links against the headers/lib).
- Optional but recommended:
opencv-dev
for the C++ extractor.
# from project root
make -j$(nproc) # builds …
# • CPP_Project/Inference_Time_Tester
# • CPP_Project/Feature_Extractor
The helper Makefile
simply iterates through every CPP_Project/*/CMakeLists.txt
, wipes the old build/
directory, configures with cmake -S . -B build
, then invokes the native generator.
cd CPP_Project/Inference_Time_Tester && cmake -B build -S . && cmake --build build -j
cd CPP_Project/Feature_Extractor && cmake -B build -S . && cmake --build build -j
LSeg_Image_Encoder_TensorRT/
│
├── CPP_Project/ # C++ projects
│ ├── Feature_Extractor/ # Feature extractor project
│ │ ├── CMakeLists.txt # CMake configuration
│ │ └── main.cpp # Feature extractor main code
│ ├── Inference_Time_Tester/ # Inference benchmarking project
│ │ ├── CMakeLists.txt # CMake configuration
│ │ └── main.cpp # Benchmarking main code
│ └── third_party/ # Third-party libraries
│ └── cnpy/ # CNpy submodule for numpy I/O
│
├── Visual_Demo/ # Demo scripts and results
│ ├── demo.sh # demo.sh script
│ ├── demo.py # Python wrapper for demo.py
│ ├── demo_wordFree.sh # demo_wordFree.sh script
│ ├── demo_wordFree.py # Python wrapper for demo_wordFree.py
│ └── images/ # Visualization results and input images
│ ├── Dog_grass_demo.png # Example segmentation result
│ ├── Dog_grass_wordFree.png # Example word-free segmentation result
│ └── dog_grass.jpeg # Example input image
│
├── models/
│ ├── weights/
│ │ ├── ViT/
│ │ │ ├── demo_e200.ckpt # ViT-L/16 CLIP checkpoint
│ │ │ └── fss_l16.ckpt # FSS-trained ViT model
│ │ └── Resnet/
│ │ ├── coco_fold1.ckpt # ResNet-ZS custom model
│ │ ├── fss_rn101.ckpt # ResNet-ZS FSS variant
│ │ └── pascal_fold1.ckpt # ResNet-ZS custom model
│ ├── onnx_engines/
│ │ ├── lseg_img_enc_vit_demo_e200.onnx
│ │ ├── lseg_img_enc_vit_fss_l16.onnx
│ │ ├── lseg_img_enc_rn101_coco_fold1.onnx
│ │ ├── lseg_img_enc_rn101_fss_rn101.onnx
│ │ └── lseg_img_enc_rn101_pascal_fold1.onnx
│ └── trt_engines/
│ └── <...>.trt # Auto-generated TensorRT engines
│
├── modules/ # LSeg model-related source code
│ ├── lseg_module.py # LSegModule: wraps image encoder + head
│ ├── lseg_full.py # LSegFull: complete network (encoder + decoder)
│ ├── models/ # Internal submodules
│ ├── lseg_blocks.py # RefineNet blocks and skip-connections
│ ├── lseg_net.py # Network assembly utilities
│ └── lseg_vit.py # CLIP ViT layer partitioning and feature extraction
│
├── conversion/ # Model conversion scripts
│ ├── model_to_onnx.py # ViT‐L/16 → ONNX
│ ├── model_to_onnx_zs.py # ResNet101-ZS → ONNX
│ └── onnx_to_trt.py # ONNX → TensorRT (common)
│
├── python_trt_comp/ # Python-based feature comparison scripts
│ ├── compare_features.py # Compare feature maps (Cosine/L2)
│ ├── compare_inputs.py # Compare input tensors
│ ├── model_output.py # PyTorch feature extraction script
│ └── run_feature_comparison.sh # Run the complete comparison pipeline
│
├── inferenceTimeTester.py # Main inference benchmarking script (root directory)
│
├── requirements.txt # Python package list
├── Makefile # One-shot builder wrapper
└── README.md # This file
Download weights from the official LSeg repository: https://github.com/isl-org/lang-seg
pip install gdown
# Main ViT-L/16 model (demo_e200.ckpt)
gdown 'https://drive.google.com/uc?id=1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7'
# FSS-based models
# fss_rn101.ckpt (ResNet101)
gdown 'https://drive.google.com/uc?id=1UIj49Wp1mAopPub5M6O4WW-Z79VB1bhw'
# fss_l16.ckpt (ViT-L/16)
gdown 'https://drive.google.com/uc?id=1Nplkc_JsHIS55d--K2vonOOC3HrppzYy'
Save the downloaded checkpoints under models/weights/
.
-
ViT Backbone
python3 conversion/model_to_onnx.py \ --weights models/weights/ViT/demo_e200.ckpt
--weights
: Checkpoint address
→
models/onnx_engines/lseg_img_enc_vit_demo_e200.onnx
-
ResNet-ZS Backbone
python3 conversion/model_to_onnx_zs.py \ --weights models/weights/Resnet/fss_rn101.ckpt
--weights
: Checkpoint address
→
models/onnx_engines/lseg_img_enc_rn101_fss_rn101.onnx
Caution: The TensorRT conversion depends on your GPU and environment. Therefore, it must be performed on the target device.
python3 conversion/onnx_to_trt.py \
--onnx models/onnx_engines/<base>.onnx \
--workspace 1073741824 \
--fp16 \
--sparse \
--disable-timing-cache \
--gpu-fallback \
--debug
-
Both ViT and ResNet-ZS commonly uses
onnx_to_trt.py
. -
Generated engines will be saved in models/trt_engines/.
Option | Type | Default | Description |
---|---|---|---|
--onnx <PATH> |
required | — | Path to input ONNX file |
--workspace <BYTE> |
integer | 1<<29 |
Builder workspace memory in bytes |
--fp16 / --no-fp16 |
flag | true | Enable or disable FP16 precision |
--sparse / --no-sparse |
flag | true | Enable or disable sparse weights |
--disable-timing-cache |
flag | false | Disable timing cache (↑ stability, ↓ speed) |
--gpu-fallback |
flag | false | Allow GPU fallback in INT8 mode |
--debug |
flag | false | Enable debug logging |
Engine filename auto-generation rule:
base__<option1>_<option2>_..._<wsXXMiB>.trt
Run inference/inferenceTimeTester.py
to benchmark the latency of PyTorch, ONNX, TensorRT.
python3 inferenceTimeTester.py \
--weights_dir models/weights \
--img_sizes 256 320 384 480 640 768 1024 \
--iterations 1000 \
--trt_fp16 --trt_sparse --trt_no_tc --trt_gpu_fb --trt_debug \
--trt_workspace 1073741824
-
--weights_dir models/weights
-
ViT non-ZS Model :
.ckpt
in theViT/
-
ResNet-ZS Model :
.ckpt
in theResnet/
-
--img_sizes
: List of input sizes for benchmarking--iterations
: Number of iterations--trt_*
: TRT Build Options (Automatically apply to ONNX→TRT)
Script Behavior:
- Automatically generates ONNX file if missing.
- Automatically generates TensorRT engine if missing.
- Performs inference benchmarking in the order: PyTorch → ONNX → TRT.
Example Result:
- Results summarized by Backbone, Checkpoint, and Size in Avg(ms) ± Std(ms) table format:
[RESULT] PyTorch Avg: 12.345 ms ± 0.123 ms [RESULT] ONNX Avg: 10.567 ms ± 0.098 ms [RESULT] TRT Avg: 5.432 ms ± 0.045 ms
AMD Ryzen 7 9700X | 8C / 16T @ 5.0 GHz
NVIDIA RTX 4090 | 24 GB (Ada, 550 W limit)
64 GB DDR5‑6000 | dual‑rank
TensorRT 10.9 + CUDA 12.2, PyTorch 2.3 (cu118)
Ubuntu 22.04 LTS | Linux 6.5
Hardware script (hardware_spec.sh
) dumps the table automatically.
inferenceTimeTester.py --iterations 1000
Size | PyTorch ms | ± | TRT-Python ms | ± | TRT-C++ ms | ± |
---|---|---|---|---|---|---|
256 | 3.73 | 0.11 | 3.23 | 0.05 | 1.60 | 0.15 |
320 | 4.33 | 0.32 | 4.27 | 0.04 | 1.71 | 0.15 |
384 | 5.23 | 0.46 | 5.60 | 0.04 | 1.92 | 0.15 |
480 | 6.80 | 0.63 | 8.34 | 0.18 | 2.49 | 0.31 |
640 | 10.93 | 1.02 | 14.54 | 0.23 | 4.12 | 0.28 |
768 | 15.99 | 1.28 | 21.16 | 0.22 | 6.17 | 0.36 |
1024 | 27.23 | 2.32 | 37.34 | 0.32 | 10.49 | 0.34 |
Size | PyTorch ms | ± | TRT-Python ms | ± | TRT-C++ ms | ± |
---|---|---|---|---|---|---|
256 | 10.57 | 1.09 | 5.55 | 0.12 | 3.89 | 0.25 |
320 | 14.03 | 1.50 | 6.87 | 0.15 | 4.60 | 0.30 |
384 | 20.41 | 1.93 | 8.55 | 0.35 | 4.71 | 0.39 |
480 | 28.30 | 2.58 | 11.63 | 0.29 | 5.78 | 0.24 |
640 | 57.76 | 4.78 | 21.70 | 0.34 | 10.44 | 0.46 |
768 | 79.41 | 6.04 | 31.13 | 1.84 | 15.77 | 0.60 |
1024 | 173.31 | 12.30 | 58.65 | 1.42 | 30.85 | 0.66 |
Observations
[TensorRT Optimization]
- TensorRT ( Python API ) already yields a 2 – 3× speed‑up over eager PyTorch.
- The minimalist C++ runner shaves another ~40 % latency, dominated by
- avoiding
pycuda
/ DLPack marshalling overheads; - pre‑parsing I/O tensor indices at start‑up.
- avoiding
- Slope ≈ O(N²) w.r.t spatial resolution (expected for ViT windowed attention).
[Backbone Image Encoder]
- ResNet-ZS vs ViT (PyTorch eager)
- ResNet-ZS is ~2.8× faster at 256² (3.7 ms vs 10.6 ms) and the gap widens to ~6.4× at 1024² (27.2 ms vs 173.3 ms).
- ResNet-ZS vs ViT (TRT-Python)
- Speed-up is milder (≈1.3–1.5×), e.g. 3.2 ms vs 5.6 ms at 256², and 37.3 ms vs 58.6 ms at 1024².
- ResNet-ZS vs ViT (TRT-C++)
- C++ runner further reduces latency by ~35–40 %; ResNet-ZS: 1.6 ms→ vs ViT: 3.9 ms at 256².
- Overall
- ResNet-ZS offers much lower absolute latency across all APIs, while ViT’s heavier computation makes its acceleration benefits more dramatic under TensorRT.
This script performs segmentation on a given image using the ONNX model and visualizes the results.
# Example usage (run from root directory)
python3 Visual_Demo/demo.py --image Visual_Demo/images/dog_grass.jpeg \
--labels "dog, grass, other" \
--onnx models/onnx_engines/lseg_img_enc_vit_ade20k.onnx \
--size 384
--image
: Path to input image--labels
: Comma-separated label list (e.g., "cat, sky, building")--onnx
: Path to ONNX model file--size
: Input size for the model (HxW)
Internally, the script calls demo.py, displaying the original image on the left and the segmentation result on the right.
This script performs pixel-level classification using the full CLIP vocabulary and prints the identified words to the console while visualizing the results.
# Example usage (run from root directory)
python3 Visual_Demo/demo_wordFree.py --image Visual_Demo/images/dog_grass.jpeg \
--onnx models/onnx_engines/lseg_img_enc_vit_ade20k.onnx \
--size 384
--image
: Path to input image--onnx
: Path to ONNX model file--size
: Input size for the model (HxW)
Internally, the script calls demo_wordFree.py
, selecting the most similar token from the full CLIP vocabulary for each pixel and visualizing the results while printing the identified words.
Below are example results saved in the Visual_demo/images/
directory:
Segmentation (demo.py ) |
Word-free (demo_wordFree.py ) |
|
---|---|---|
![]() |
![]() |
This chapter validates that our FP16, sparse-kernel TensorRT engines remain numerically faithful to the original PyTorch checkpoints and that semantic relationships between feature maps are preserved across back-ends and weights.
Script / Binary | Role | Runtime |
---|---|---|
python_trt_comp/model_output.py |
Load an LSeg checkpoint, drop the decoder, run the encoder only, dump a (B, 512, H/2, W/2) feature tensor as *.npy . |
PyTorch + CUDA |
CPP_Project/Feature_Extractor/build/trt_feature_extractor |
Deserialise the dynamic-shape TensorRT engine, feed a BGR image, emit an identical tensor. | C++ / TensorRT |
python_trt_comp/compare_features.py |
Flatten PyTorch vs TensorRT tensors and report cosine similarity + L2 norm. | Python (CPU) |
python_trt_comp/run_feature_comparison.sh |
Glue script that loops over images × checkpoints × resolutions. | Bash |
bash python_trt_comp/run_feature_comparison.sh
# ➜ results appear under outputs/ and as console logs
(RTX 4090 · TensorRT 10.9 · FP16 engines with sparse weights)
Backbone | Weight | µ Cosine ↑ | σ Cosine | µ L2 ↓ | σ L2 | min L2 / max L2 |
---|---|---|---|---|---|---|
ResNet-50 | coco_fold1 | 1.0027 | 0.0018 | 1.58 | 0.68 | 0.70 / 2.76 |
fss_rn101 | 1.0029 | 0.0023 | 6.51 | 2.97 | 2.77 / 12.23 | |
pascal_fold1 | 1.0020 | 0.0014 | 2.93 | 0.97 | 1.60 / 5.14 | |
ViT-B/16 | demo_e200 | 1.0019 | 0.0014 | 2.16 | 1.80 | 0.38 / 5.66 |
fss_l16 | 1.0037 | 0.0021 | 2.79 | 1.00 | 1.79 / 4.96 |
Interpretation
Cosine similarity is essentially unity (≥ 0.999) for all 40 image–size combinations we tested, meaning < 0.2 % angular error after FP16 quantisation, structural sparsity, kernel fusion and Winograd re-ordering.
The ResNet coco_fold1 model gives the tightest L2 spread (median ≈ 1.5); fss_rn101 is deliberately trained on few-shot masks and therefore exhibits higher magnitude feature activations, which inflates L2 while leaving angle intact.
ViT-based engines track PyTorch within ±0.002 cosine / ±0.05 σ — negligible for retrieval or segmentation tasks.
(ade20k tag vs fss tag, averaged over cat, cat2, cat3)
Backbone | Size (px) | Cosine (PT) | Cosine (TRT) | Δ | L2 (PT) | L2 (TRT) | Δ |
---|---|---|---|---|---|---|---|
ResNet-50 | 480 | –0.0403 | –0.0411 | 8 e-4 | 318.1 | 318.4 | 0.3 |
320 | –0.0189 | –0.0194 | 5 e-4 | 250.7 | 250.9 | 0.2 | |
ViT-B/16 | 480 | –0.0257 | –0.0258 | 1 e-4 | 343.6 | 343.7 | 0.1 |
320 | –0.0041 | –0.0047 | 6 e-4 | 226.7 | 226.8 | 0.1 |
Interpretation
Negative cosine values confirm that the aggregate embeddings for ade20k and fss are near-orthogonal, signalling that the two label sets live in distinctly separated sub-spaces.
TensorRT reproduces PyTorch within |Δ cos| ≤ 0.0008 and |Δ L2| ≤ 0.3, well below intra-dataset variance.
ResNet shows slightly stronger orthogonality (–0.04 vs –0.026) because its convolutional filters are less text-conditioned than ViT’s global token mixer.
Metric | ViT-B/16 (demo + fss) | ResNet-50 (3 weights) | Observation |
---|---|---|---|
Mean Cosine PT ↔ TRT | 1.0028 | 1.0025 | Both back-ends < 0.2 % angular drift. |
Worst-case L2 | 5.66 | 12.23 | ResNet sees higher L2 due to fss_rn101’s large activations. |
Cross-Tag Cosine | –0.025 (±0.010) | –0.033 (±0.012) | ResNet features are slightly more orthogonal across tags. |
Encoder FLOPs | 12.4 G | 9.7 G | ViT costs more but benefits from parallel friendly GEMMs. |
TensorRT FPS (224², BS = 1) | 1030 | 1180 | ResNet leverages sparsity better; ViT still exceeds 1 kfps. |
Take-away — Choosing between ViT and ResNet is workload-dependent:
ViT delivers denser, more isotropic language–vision embeddings ideal for prompt-tuning, whereas ResNet provides leaner, more localized features that compress well and run faster on sparse tensors.
• Left: every pixel overlay shows absolute difference < 0.015, matching the tabular metrics.
• Right: heat-maps confirm that spatial saliency is preserved across (256–480 px) and all five checkpoints — brighter zones overlap exactly between PyTorch and TensorRT.
The combined quantitative (cosine, L2) and qualitative (heat-map) analyses demonstrate that our FP16, sparse TensorRT pipelines replicate PyTorch encoders with sub-percent error, regardless of backbone, resolution or training corpus. This guarantees drop-in replacement for downstream tasks such as zero-shot segmentation, CLIP-style retrieval and long-horizon robot planning.
- Using ONNX opset_version=14
- Supports dynamic input size: Refer to
torch.onnx.export(... dynamic_axes=...)
setting - Requires
onnxruntime-gpu
for GPU benchmarking:pip install onnxruntime-gpu
- Verify CUDAExecutionProvider:
import onnxruntime as ort
print(ort.get_available_providers())
MIT – see LICENSE
for details.
Portions of the code are adapted from ISL‑org / lang‑seg (Apache‑2.0) and NVIDIA TensorRT samples.