vit executorch inference speed much slower than onnx #6961

salvadog · 2024-11-19T14:14:08Z

🐛 Describe the bug

I've encountered a performance issue where executorch's inference speed is significantly slower compared to ONNX, both on linux pc and Android phone. I believe this is a critical issue that needs to be addressed as it affects the efficiency of our model deployment.

Environment:

onnx==1.17.0
onnxruntime==1.20.0
executorch==0.3.0
torch==2.4.0+cu121
python=3.10.15

Linux pc hardware: NVIDIA A100 80GB, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Android phone hardware: Qualcomm Snapdragon 8+ Gen 1

Reproduction Steps:

The vit is an InternVIT-300M model, with 7 * 3 * 448 * 448 input size.

I export vit model with:

python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize

And inference it on linux pc with:

./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./internvit_xnnpack_q8.pte

inference on Android with:

adb shell ./data/local/tmp/vit/xnn_executor_runner_android --model_path /data/local/tmp/vit/internvit_xnnpack_q8.pte

Expected Behavior:

I'm not quite familiar with inference times for both ONNX and executorch, but I thought they should be within an acceptable performance margin. And I've already exported a llama2-2B model, with a considerable speed TTFT 0.5s + 30 tokens/s on my Android phone. I thought vit-300M inference speed shoud be some how similiar.

Actual Behavior:

onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s

Questions:

Is there any known performance regression in executorch compared to ONNX?
Are there any optimization techniques or configurations that can improve vit excutorch's performance?
I would appreciate any guidance on how to resolve this performance discrepancy. Thank you for your attention to this issue.

Versions

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31

Python version: 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-191-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.6.68
CUDA_MODULE_LOADING set to: LAZY

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

metascroy · 2024-11-19T18:03:16Z

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

salvadog · 2024-11-20T01:52:53Z

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.

metascroy · 2024-11-20T21:43:56Z

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.

Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86.

cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK

salvadog · 2024-11-21T12:34:12Z

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.

Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86.

cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK

I've made sure ExecuTorch is built with relase mode. My main concern is the inference speed is good for ExecuTorch llama2-2B on Android, but quite slow for VIT under similar export method and settings. Is this an expected behavior or something goes wrong. @metascroy @digantdesai

digantdesai · 2024-11-21T17:19:16Z

Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.

onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s

This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.

check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

As @metascroy suggested, can we try this?

salvadog · 2024-11-22T03:52:14Z

Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.
onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s
This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.

check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

As @metascroy suggested, can we try this?

Thanks for helping out! My export commands are

Llama:
python -m examples.models.llama2.export_llama --checkpoint /XXX/checkpoint.pth \ -p /XXX/config.json \ -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 \ --metadata '{"get_bos_id":1, "get_eos_id":2}' \ --embedding-quantize 4,32 --output_name="internlm2_2B_kv_sdpa_xnn_qe_4_32.pte"

VIT:

python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize

I've attached the Llama and VIT export logs, the VIT log is quite long, so I only attached the beginning and ending part. I didn't see information about model graph in VIT log. Could you tell me how to modify the code to

check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

And let me know if any other information are needed.

llama_export_log.txt
vit_export_log.txt

digantdesai · 2024-11-22T17:30:54Z

Thanks a ton for sharing the output. The vit text file vit_export_log.txt does contain the exported and w/ delegation graph.

So looking at the graph post delegation,

# line 336 where the export graph with delegate starts in your file.
$ awk 'NR > 336' vit_export_log.txt \
  | grep -o "call_function\[target=.*\](" \
  | sed -r "s/call_function\[target=(.*)\]\(/\1/g" \
  | sort -h | uniq -c | sort -n
   1 executorch.exir.dialects.edge._ops.aten.select_copy.int
   6 executorch.exir.dialects.edge._ops.aten.gelu.default
  11 executorch.exir.dialects.edge._ops.aten.native_layer_norm.default
  12 executorch.exir.dialects.edge._ops.aten.bmm.default
  16 executorch.exir.dialects.edge._ops.aten.squeeze_copy.dims
  18 executorch.exir.dialects.edge._ops.aten.clone.default
  24 executorch.exir.dialects.edge._ops.aten.expand_copy.default
  47 executorch.exir.dialects.edge._ops.aten.view_copy.default
  52 torch.ops.higher_order.executorch_call_delegate # these lower to XNNPACK
  96 operator.getitem

So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow.

You can validate this by doing something like
adb shell "cd /data/local/tmp; simpleperf record xnn_executor_runner_android --model_path ./vit/internvit_xnnpack_q8.pte && simpleperf report" | less

And skimming the CMake file, it seems like we may not be using optimized library optimized_ops_lib for xnn_executor_runner.

salvadog · 2024-11-25T07:56:45Z

Thanks a ton for sharing the output. The vit text file vit_export_log.txt does contain the exported and w/ delegation graph.

So looking at the graph post delegation,
# line 336 where the export graph with delegate starts in your file.
$ awk 'NR > 336' vit_export_log.txt \
  | grep -o "call_function\[target=.*\](" \
  | sed -r "s/call_function\[target=(.*)\]\(/\1/g" \
  | sort -h | uniq -c | sort -n
   1 executorch.exir.dialects.edge._ops.aten.select_copy.int
   6 executorch.exir.dialects.edge._ops.aten.gelu.default
  11 executorch.exir.dialects.edge._ops.aten.native_layer_norm.default
  12 executorch.exir.dialects.edge._ops.aten.bmm.default
  16 executorch.exir.dialects.edge._ops.aten.squeeze_copy.dims
  18 executorch.exir.dialects.edge._ops.aten.clone.default
  24 executorch.exir.dialects.edge._ops.aten.expand_copy.default
  47 executorch.exir.dialects.edge._ops.aten.view_copy.default
  52 torch.ops.higher_order.executorch_call_delegate # these lower to XNNPACK
  96 operator.getitem
So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow.

You can validate this by doing something like adb shell "cd /data/local/tmp; simpleperf record xnn_executor_runner_android --model_path ./vit/internvit_xnnpack_q8.pte && simpleperf report" | less

And skimming the CMake file, it seems like we may not be using optimized library optimized_ops_lib for xnn_executor_runner.

Thank you so much for your invaluable help, @digantdesai! I've included the output from the Android ET runner and the simpleperf report. It currently takes 55 seconds to process a tensor of shape [8,3,448,448] with a 300M VIT model. The simpleperf report indicates that the BMM operation is not leveraging XNNPACK and is responsible for 70% of the total time expenditure.

Does this imply that if we were to optimize the BMM operation with XNNPACK, we could potentially reduce the total time to 55s * 0.3 = 16s? Even so, this would still be a significant amount of time. I'm curious about the expected performance for executing a VIT model of this scale with ET and whether there are any benchmarks or examples I could use for reference. Additionally, I am eager to explore optimization strategies to achieve my ideal running speed of 1 second. Is this goal attainable, and if so, what steps should I take to optimize the performance further?

vit_simpleperf_log.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vit executorch inference speed much slower than onnx #6961

vit executorch inference speed much slower than onnx #6961

salvadog commented Nov 19, 2024 •

edited

Loading

Tasks

metascroy commented Nov 19, 2024

salvadog commented Nov 20, 2024 •

edited

Loading

metascroy commented Nov 20, 2024

salvadog commented Nov 21, 2024

digantdesai commented Nov 21, 2024 •

edited

Loading

salvadog commented Nov 22, 2024

digantdesai commented Nov 22, 2024

salvadog commented Nov 25, 2024

vit executorch inference speed much slower than onnx #6961

vit executorch inference speed much slower than onnx #6961

Comments

salvadog commented Nov 19, 2024 • edited Loading

🐛 Describe the bug

Versions

Tasks

metascroy commented Nov 19, 2024

salvadog commented Nov 20, 2024 • edited Loading

metascroy commented Nov 20, 2024

salvadog commented Nov 21, 2024

digantdesai commented Nov 21, 2024 • edited Loading

salvadog commented Nov 22, 2024

digantdesai commented Nov 22, 2024

salvadog commented Nov 25, 2024

salvadog commented Nov 19, 2024 •

edited

Loading

salvadog commented Nov 20, 2024 •

edited

Loading

digantdesai commented Nov 21, 2024 •

edited

Loading