INT64 clamping to INT32 creates overhead while using TensorRT #627

JingyaHuang · 2022-12-21T09:47:05Z

System Info

optimum: dev
CUDA: 11.3
cuDNN: 8.3.2
TensorRT: 8.4.1.5

Who can help?

@JingyaHuang

Reproduction

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

ort_model = ORTModelForSequenceClassification.from_pretrained(
    "philschmid/tiny-bert-sst2-distilled",
    from_transformers=True,
    provider="TensorrtExecutionProvider",
)

tokenizer = AutoTokenizer.from_pretrained("philschmid/tiny-bert-sst2-distilled")
inp = tokenizer("expectations were low, actual enjoyment was high", return_tensors="pt", padding=True)

result = ort_model(**inp)
assert ort_model.providers == ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]

Log

2022-12-21 09:40:48.208932701 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 09:40:48 WARNING] external/onnx-tensorrt/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32

Expected behavior

Clamping should be done before the inference.

Will work on a convert_int64_to_int32() function to solve it.

The text was updated successfully, but these errors were encountered:

fxmarty · 2022-12-21T10:05:12Z

@JingyaHuang the warning

2022-12-21 11:03:15.553857281 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 10:03:15 WARNING] external/onnx-tensorrt/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.

is raised at the model initialization, so I'm not sure what we can do more? The inputs themselves being int64 from the tokenizer.

fxmarty · 2022-12-30T11:19:16Z

Have a working prototype for this. trtexec comes very handy to test out. The warnings come due to Slice operator having end indexes as the maximum representable int64 value. Casting to np.iinfo(np.int32).max + astype(np.int32) removes the warnings.

JingyaHuang added the bug Something isn't working label Dec 21, 2022

JingyaHuang self-assigned this Dec 21, 2022

fxmarty mentioned this issue Dec 22, 2022

Causal attention hard-codes lower triangular matrices as uint8 #636

Closed

4 tasks

fxmarty mentioned this issue Dec 30, 2022

ONNX transformation to cast int64 constants to int32 when possible #655

Merged

fxmarty closed this as completed in #655 Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT64 clamping to INT32 creates overhead while using TensorRT #627

INT64 clamping to INT32 creates overhead while using TensorRT #627

JingyaHuang commented Dec 21, 2022 •

edited

Loading

fxmarty commented Dec 21, 2022

fxmarty commented Dec 30, 2022

INT64 clamping to INT32 creates overhead while using TensorRT #627

INT64 clamping to INT32 creates overhead while using TensorRT #627

Comments

JingyaHuang commented Dec 21, 2022 • edited Loading

System Info

Who can help?

Reproduction

Log

Expected behavior

fxmarty commented Dec 21, 2022

fxmarty commented Dec 30, 2022

JingyaHuang commented Dec 21, 2022 •

edited

Loading