-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opencl: improve profiling #12442
opencl: improve profiling #12442
Conversation
lhez
commented
Mar 18, 2025
- Wait for profiling events and collect profiling data when model execution is done. This way, the displayed performance numbers are more close to the true performance.
- Generate a chrome trace in addition to csv.
* Populate profiling timing info at the end rather than after each kernel run
sorry to bother you, how can I mark a specified PR as ready for review? thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and reminds me that I wanted to integrate graph-profiler branch with opencl and cpu backends.
If you start a Draft PR there is a way to mark it ready. qnn-backend PR is not marked as draft.
I've been keep an eye on it. In general, I'd say QNN is not the right solution here but I'll take another look. |
@max-krasnyansky, thanks so much for your valuable guidance/correction on direction. I think I know something about the third tech approach of "utilize the Hexagon NPU maximally", in other words, the Hexagon DSP SDK should be used in the third tech approach, which is exactly similar to what your excellent engineering team did with ggml-opencl, or which is exactly similar to what I did with video decoding hardware acceleration many years ago(it's also a DSP chip). my guess might-be not correct, accordingly, it's greatly appreciated that you can give me/the llama.cpp community a clear explanation or a roughly confirmation of the third tech approach. |
yes, you are absolutely correct, thanks so much. I'll try to test it on a rooted 8gen3 phone to get better NPU performance(closer to the default ggml backend or better then the default ggml backend). [updated on 03/21/2025,23:54] offload to cDSP directly is really much faster than QNN-CPU/QNN-GPU/QNN-NPU backend. now GGML_OP_ADD works fine on cDSP and verified with llama-cli: the NPU performance is good. ggml-hexagon-offload-to-cdsp-directly.mp4I think this approach is exactly similar to ggml-opencl: we need write some "kernels" on cDSP and then offload specified ggml ops to cDSP directly. there is a question in this approach: build the entire source code is not easy because one of parts are running on AP(arm-cpu) and one of parst are running on cDSP(NPU), at the same time, opencl provide a mature mechanism to manage/compile "kernels" naturally. hope mulmat can also works fine on cDSP and NPU performance is also good tomorrow, then that PR can be reviewed formally. [updated on 03/22/2025,23:22] source code has been submitted in that PR. there is an unknown issue about mulmat on cDSP. another issue in that PR: I didn't find a general approach to compile/build the hexagon kernel functions in that PR at the moment and a manual method can works fine in my local dev envs. |
[updated on 03/24/2025,15:32]mulmat performance between QNN-NPU and cDSP: mulmat performance between QNN-NPU and cDSP case-1: mulmat with cDSP(with naive algorithm): case-2: mulmat with cDSP(with naive algorithm): mulmat with cDSP(with algorithm from ggml-dsp(a tiny customized ggml on cDSP), but the compute result is not correct): mulmat with cDSP(with naive algorithm): mulmat with cDSP(with naive algorithm): @max-krasnyansky, I think this testcase can verify what you said "QNN is not the right solution here" is absolutely correct if there is no misunderstanding or mistake in this testcase/experiment: we can clearly see that performance of mulmat with cDSP is much faster then mulmat with QNN-NPU. unfortunately, this hand-written implementation of mulmat on cDSP is very naive(without any optimization) and can only pass various case in UT with self-made command line program, it can't works with llama-cli. I think AI experts can have chance to improve the mulmat algorithm on cDSP if that PR can be approved: this approach is exactly what you pointed out, developers and AI experts can do a lot on the cDSP rather than based on the black box in the QNN SDK. I'd like to make a little contribution to Qualcomm although I'm an independent programmer, thanks so much! |