This repo contains a Vision Transformer video model based on the VideoMAE V2 paper and code, as well as examples for compiling the model using TensorRT and running inference using the built engine.
It is part of a blog post describing an issue with compiling this model using TensorRT - to get a working engine you'll need to find the Uncomment and change this to use the desired attention module
line and use one of the working Attention layers.
Download the distilled checkpoint by running:
wget https://huggingface.co/OpenGVLab/VideoMAE2/resolve/main/distill/vit_s_k710_dl_from_giant.pth
After downloading the weight, you can run inference.
For the included video:
Running:
uv run python main.py infer
Should print something like:
making tea: 0.81
setting table: 0.01
opening door: 0.01
Note: be sure to use an Attention layer that works with TensorRT.
# Use faster settings for torch inference (half precision, torch.compile, ..)
uv run python main.py infer --fast
# Export the model to an ONNX file
uv run python main.py export_onnx
# and run inference using ONNX runtime
uv run python main.py infer_onnx
# Build a TensorRT engine from the ONNX
uv run python build_trt.py
# and run inference using the built engine
uv run python main.py infer_trt
Checking with different TensorRT versions can be done using docker
and NVIDIA's pytorch
images:
$ docker run --gpus all --rm -it -v $(pwd):/code -w /code nvcr.io/nvidia/pytorch:24.12-py3 bash
$ root@cd60802e9604:/code# pip install "onnxruntime>=1.17.1" "pyav<14.0.0" "timm>=1.0.12"
$ root@cd60802e9604:/code# python ./main.py export_onnx && python ./build_trt.py && python ./main.py infer_trt