Using AI hardware accelerators on modern computers and possibly smartphones #1658
Dampfinchen
started this conversation in
Ideas
Replies: 1 comment
-
What's your opinion about soc specific accelerators from qualcomm, Huawei, etc? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I've been thinking about this for a while. Starting with AMD's Ryzen AI mobile processors as well as Intel's upcoming Meteor Lake VPUs, consumers will get more access to built in AI hardware which handle matrix multiplication faster and more efficiently compared to GPUs and CPUs. This is especially useful on mobile devices. Smartphones have been equipped with NPUs of different kinds for quite a while. Nvidia introduced AI acceleration for consumers all the way back in 2018 when they launched the Turing architecture. AMD introduced AI acceleration with RDNA3.
I think it's interesting to ponder about how to use AI accelerators for efficiency and speedups that can be integrated into llama.cpp inference and possibly even training when the time comes.
Microsoft and Nvidia recently introduced Olive optimized ONNX models for Stable Diffusion, which improve performance by two times using tensor cores. (See here for reference: https://blogs.nvidia.com/blog/2023/05/23/microsoft-build-nvidia-ai-windows-rtx/ and https://devblogs.microsoft.com/directx/dml-stable-diffusion/
Since GGML is heading towards a more GPU accelerated approach, I wonder if incorporating some of these optimizations into the GGML format could lead to nice speedups when using GPU layer offloading.
But as far as I understand, the bottlenecks with llama.cpp are currently more related to memory rather than compute, so dedicated AI accelerators like tensor cores do not result in a speedup right now. This can be observed by comparing the performance of different architectures and also running NSight Systems - Tensor cores are active, but do not do much as most of the calculations are done in FP32.
Since FP32 is also very memory bandwidth intense compared to FP16, I do wonder if such high precision is really necessary for inference and if it would be a good idea to run inference at half precision in general. This could potentially reduce memory bottlenecks which in turn would shift bottlenecks from memory to compute, where AI accelerators like tensor cores could provide a noticeable speedup.
What are your thoughts on this? I am very interested to hear what your ideas are to leverage modern hardware to its full extent!
Beta Was this translation helpful? Give feedback.
All reactions