-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vit executorch inference speed much slower than onnx #6961
Comments
@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend. |
Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend. |
Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend). Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86. cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK |
I've made sure ExecuTorch is built with relase mode. My main concern is the inference speed is good for ExecuTorch llama2-2B on Android, but quite slow for VIT under similar export method and settings. Is this an expected behavior or something goes wrong. @metascroy @digantdesai |
Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.
This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.
As @metascroy suggested, can we try this? |
Thanks for helping out! My export commands are Llama: VIT:
I've attached the Llama and VIT export logs, the VIT log is quite long, so I only attached the beginning and ending part. I didn't see information about model graph in VIT log. Could you tell me how to modify the code to
And let me know if any other information are needed. |
Thanks a ton for sharing the output. The vit text file So looking at the graph post delegation,
So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow. You can validate this by doing something like And skimming the CMake file, it seems like we may not be using optimized library |
Thank you so much for your invaluable help, @digantdesai! I've included the output from the Android ET runner and the simpleperf report. It currently takes 55 seconds to process a tensor of shape [8,3,448,448] with a 300M VIT model. The simpleperf report indicates that the BMM operation is not leveraging XNNPACK and is responsible for 70% of the total time expenditure. Does this imply that if we were to optimize the BMM operation with XNNPACK, we could potentially reduce the total time to 55s * 0.3 = 16s? Even so, this would still be a significant amount of time. I'm curious about the expected performance for executing a VIT model of this scale with ET and whether there are any benchmarks or examples I could use for reference. Additionally, I am eager to explore optimization strategies to achieve my ideal running speed of 1 second. Is this goal attainable, and if so, what steps should I take to optimize the performance further? |
🐛 Describe the bug
I've encountered a performance issue where executorch's inference speed is significantly slower compared to ONNX, both on linux pc and Android phone. I believe this is a critical issue that needs to be addressed as it affects the efficiency of our model deployment.
Environment:
onnx==1.17.0
onnxruntime==1.20.0
executorch==0.3.0
torch==2.4.0+cu121
python=3.10.15
Linux pc hardware: NVIDIA A100 80GB, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Android phone hardware: Qualcomm Snapdragon 8+ Gen 1
Reproduction Steps:
The vit is an InternVIT-300M model, with 7 * 3 * 448 * 448 input size.
I export vit model with:
python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize
And inference it on linux pc with:
./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./internvit_xnnpack_q8.pte
inference on Android with:
adb shell ./data/local/tmp/vit/xnn_executor_runner_android --model_path /data/local/tmp/vit/internvit_xnnpack_q8.pte
Expected Behavior:
I'm not quite familiar with inference times for both ONNX and executorch, but I thought they should be within an acceptable performance margin. And I've already exported a llama2-2B model, with a considerable speed TTFT 0.5s + 30 tokens/s on my Android phone. I thought vit-300M inference speed shoud be some how similiar.
Actual Behavior:
onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s
Questions:
Is there any known performance regression in executorch compared to ONNX?
Are there any optimization techniques or configurations that can improve vit excutorch's performance?
I would appreciate any guidance on how to resolve this performance discrepancy. Thank you for your attention to this issue.
Versions
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31
Python version: 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-191-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.6.68
CUDA_MODULE_LOADING set to: LAZY
Tasks
The text was updated successfully, but these errors were encountered: