-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rockchip RK3588 perf #722
Comments
Thanks for the info |
i wonder if it could be made faster by making sure the model is in RAM maybe see if subsequent runs are faster once the model is cached? |
Thanks for posting this. Just as a heads up, the RK3588 does have NPU units on it but these are not leveraged with the llama.cpp codebase (at time of writing). If other devs are interested, the NPU API for this can be found in this file: https://github.com/rockchip-linux/rknpu2/blob/master/runtime/RK3588/Linux/librknn_api/include/rknn_api.h Note: I'm sure I've read somewhere that INT4 tensors should be supported, but I cannot see them in that API. Also, I believe the model might have to be converted to a specific RK3588 format (toolkit link in the root README.md)? I did actually expect far better performance even with the CPUs only though with a 7B model. I notice this is an 8GB RK3588, so maybe there was a lot of memory swapping happening that slowed it down. I don't have any chips with RK3588 yet, but if I manage to get one, I'll try to do some testing on my side. Might make great little units for running a dedicated assistant on if it can be optimized well. |
If there is a specific test you want me to run, let me know! I don't have any swap configured, regrettably. But what could easily have happened is that because this was running literally alongside my homeserver stuff, that memory management on th e kernel side got quite hectic. :) Also, llama.cpp has improved a lot since last time - so I might just rerun the test, to see what happens. Also, Vicuna and StableLM are a thing now. Might as well give it a shot... that said, I'd have to think of a good way to gather the output into a nice table structure - because I don't want to flood this ticket, or anyone else, with a crapton of redundant output. xD That all said, there is one more thing:
Thanks to RockChip's - at least in my experience - rather spotty documentation, I couldn't figure out if these messages were relevant or not. Though it'd actually be interesting to see INT4 on this. |
I did a quick test with this on Orange Pi 5 16GB using a 7B Q5_1 model. My setup is a bit clunky, so I don't have a proper benchmark (will re-run and edit in next week when I'm setup better), but I'd estimate performance at around at almost 1 token/sec. This was using 7 threads. The heatsink became pretty hot to touch - I suspect the slower performance above might've been due to either a) memory constraints or b) thermal throttling. Would love to see how well this could run if leveraging the NPU, but I don't think the RK SDK supports INT4 quant yet. Basically, the RK process is that the models have to be converted into an RK-compatible format using their SDK's, so the quantization probably won't be great using that approach. I haven't looked into whether RK API is low-level enough that it might be able to support running GGML models yet, but that'd probably work better than using whatever quantization process RK SDK may eventually support |
I tinkered around a bit more with this last night. I was able to get around 500ms/token using 4 threads on a 7B Q5_1. I also played around with the new OpenCL implementation (using CLBlast), but this was significantly slower if I transfer all layers to GPU (> 1s/token). I don't have time to thoroughly investigate but, looking at the GGML OpenCL implementation, I suspect a lot of the slowdown might be how memory is handled. In the OpenCL implementation, it looks like the tensors might be copied to the GPU as opposed to using a pointer to the Host Memory (I noticed some loops in there that do this). This makes sense for non-iGPUs (as they have their own VRAM), but probably results in unnecessary copy op's for devices with shared RAM/VRAM like the RK3588 (and AMD APU's for that matter). I believe there are flags that can be used to simply point OpenCL to host memory, but I'm unsure whether it would be compatible with the GGML tensor format. Might be a worthy optimization to consider though if it would speed up inference on AMD APU's also. Side-note: I have tiny heatsinks on my Orange Pi 5. These get quite hot and I notice inference time slows down quite a bit as they heat up, so assuming the device gets underclocked to maintain safe temperatures. |
if i had a 3588 i'd totally be down to fuck around with this, can anyone point me to a relatively-cheap 3588 dev board? edit: 8GB if possible |
I probably can't recommend a specific board sorry. I haven't priced them out. Just want to add to this though - the guy that's been doing a lot of the work on the llama.cpp GPU implementations isn't sure if optimizations to the OpenCL code will yield that much benefit for boards like this. He posted the following graph yesterday indicating that the big bottleneck appears to be memory. |
@spv420 here are some links - I have Orange Pi 5 & 5b plan on purchasing NanoPC-T6 & Orange Pi 5 Plus as well NanoPC T6 |
Not sure if this helps the discussion. I made a fork that supports the RK3588 NPU via the matrix multiplication API. Unfortunately it is not faster then just using the CPU and generates questionable output due to running in int8 mode (FP16 is too slow). Feel free to contribute, and see if anyone can work around the accuracy issue. I have a prototype that gets up to 10% faster by chunking operations. But it's complicated and I feel not worth the work if all I'm able to get is hallucinating outputs. I'd love to upstream the code. Please contribute if you are also interested in the subject https://github.com/marty1885/llama.cpp/tree/rknpu2-backend |
Thanks for this! I looked into it at one point too, but I think the bottleneck will be the RAM speed on the Pi 5? This approach might still be able to speed up prompt ingestion substantially though. Do you know if using the NPU reduces power consumption? I'm an idiot and installed a tiny heatsink on my Pi 5, so it throttles very quickly. Will try and give your fork a go next week when I get some time. |
No, the NPU on the RK3588 is really, really bad at matrix multiplication. It's designed for vision models thus focused on convolution. It has a pretty low FLOPS when doing matrix multiplication.
Maybe, but the inaccuracy is quite significant. I am not sure what'll happen.
I think it can. But not with my backend in the current state. My backend only uses 1 thread out out all given by GGML. And GGML will spin non-working threads. It's a design flaw in GGML itself and needs major refactor. Can't just use 1 thread either. Some matrices are too large to fit on the NPU. It's possible to split the work and distribute to different NPU cores. But I it's too much work for little gain (as the model is hallucinating constantly). To compile and run my fork. I don't recommend running more then 13 layers or a 7B model on the NPU. It starts going crazy afterwards. I develop with 10.
Also you need a |
Great thread! I have a Firefly RK3588S board, so it would be great to try this out. Don't have much hope for the NPU, but am wondering if offloading matrix multiplications to the Arm Mali GPU via Arm Computer Library might be worthwhile? Any thoughts? |
@prusnak I tried something similar with GGML's OpenCL backend way back. I modified it enough to get RWKV (not llama) running on the Mali GPU. it has many problems. Mainly
ACL can work. But I have question if it'll be helpful. GGML pre-transposes matrix B in Good luck. I'd love to see more LLMs on the edge. ==== For anyone interested; progress update on my side. With RKNPU2 1.6.0. It almost makes sense to use the NPU. I'm less then 10% off to being faster then the CPU on INT8 mode with just 1 NPU core. Next step is to debug non-square matrix multiplication. Something somewhere is wrong. I won't update every step here. Please either follow my fork or check my blog from time to time. |
@marty1885 Your work is very interesting. Have you considered running Whisper models on the NPU? Could be better suited as the models are much smaller compared to 7B LLMs and would immediately have various real-world applications. |
@ggerganov Thanks, Already done by other people. https://github.com/usefulsensors/useful-transformers runs Whisper on the NPU. They are able to do much extensive optimizations compared to GGML though. The NPU demands a custom matrix layout for maximal performance. And they are able to eliminate a majority of layout conversions by abstracting them away. Actually good idea. I can try targeting my work against whisper.cpp. Do you know any use cases for it? And what would be the process to upstream an entire new backend? |
From quick look at this repo, it looks like they use the NPU just for the matrix multiplications. All other operations, such as convolutions, softmax, layernorm, etc. are on the CPU. Does the NPU API allow to implement all other ops or is it limited just to matrix multiplications? The reason I'm wondering is that Still, if it is not possible for the NPU to do general computations, then we can perform just the heavy matrix operations in the Whisper Encoder in a similar way as we currently use BLAS. I think you've already prototyped this to a good extend in your fork. Some of the smaller matrix multiplication probably should remain on the CPU - needs experimenation. I don't see a way around reshuffling the tensor data to fit the NPU layout. This will be some overhead that the NPU backend implementation would have to perform on the input and output data. As long as the changes are contained as much as possible in |
For now it is limited to only matrix multiplications. Softmax, convolution, etc.. are locked behind their ONNX compiler and is not open source. Yeah, reordering is a major performance bottleneck right now. I hope the vendor can solve this or at least mitigate it largely. I hope future chip designers can make data layout easy and expose more low level API. I'll submit a PR if I made it useful/new SDK solve current problems. |
@marty1885 I'm in the midst of trying to reverse engineering parts of the RK3588 NPU as I'm am keen to understand how the matrix multiplication was handled by the NPU to see if it could be optimised/open sourced. From your testing for fp16 do have any insight in to how large the matrices get for llama 7b. I'm assuming they can't be larger than [512x512] x [512x512] as that would already require 0.5Mb of memory for the output for a single operation. |
@mtx512 The regular matrix multiplications on encoder/decoder weights are more like GEMV instead of GEMM. They have shape basically the following (note that in GGLM's source code
Good luck! Hope you find success. |
I doubt the NPU can actually run MatMul "natively" with matrix size >= 256x256. (for ONNX models, MatMul with size equal or larger than 256x256 cannot run on NPU!) |
TVM has better support for Mali GPU with OpenCL. See MLC-LLM project. Also I have tried to run some other small models that cannot run effectively on NPU on GPU, and it performs pretty good. |
RKNPU2 memory allocation size limit issue have been resolved in my fork by happyme531@eaf7a15 |
@happyme531 Looks like you are right. The 1.6.0 SDK does state that the product between channels cannot be >= 65532. Maybe this is the reason? They forgot to document this limitation for the matmul API? (For the people in this thread whom can't read Chinese, trust me) I've merged your fix into my fork. |
RK3588 NPU data pointers are limited to 31:0 bits (based on TRM) hence the 4GB limit. Curious why you think it can be larger? |
Honestly I do not know this limit when writing this fix. No document ever mentioned it. And the resulting code runs smoothly without a single error(except the output quality issue which have many potential causes). |
The RKNN docs mention Zero-Copy apis, for these the memory has to be compatible with the NPU, so for RK3588 this would a 32 bit address in physical memory. If your providing a physical address over 4GB I'd suspect it just truncating it to 32 bits so using a random location. If you provide a virtual address then it has copy the data to a physical location in 32bit range hence performance drop. |
Are we certain there is a constraint on 32bit PHYSICAL memory address? Looking at the RK NPU API here: ... the physical address is defined as a Also, regarding the FP16 constraint, is this a hardware limitation? In theory, it looks like it should be able to support 8bit. I've yet to play with any of this though, so take the above with a grain of salt. EDIT: Looking at that structure a bit deeper, it looks like there is a 32bit constraint on the tensors themselves. But, if these do not have to sit (or be copied) to first 4GB of physical memory, might it be possible - given that memory is shared - to take an approach where we process with the NPU layer-at-a-time? |
It's both. GGML doesn't natively do quantized inference. "quantization" to GGMl means compressing the weights, decompress it on the fly and keep it in cache. The decompressed result in still floating point and GGML does all it's math in floating point (FP32 on CPU and optionally FP16 on GPU) This is while the NPU expects both matrices to be the same type - both FP16 or INT8. I tried converting both weight and input into fixed point (INT8). It seems the network needs more accuracy then 8 bits else goes crazy if too many layers are run in this very limited accuracy. It would be perfect if RKNN can support weights in INT8/INT4 fixed point but keep inputs in FP16. But I doubt that since the NPU is more like a fixed pipeline GPU in the old days. |
Any comparison in speed token/seconds between rkllm and this version of llamacpp with npu enabled on rk3588 same quantized model like phi3 mini? |
@vincenzodentamaro Never tested. But I assume RKLLM is much faster. My backend was an experiment and never well optimized. Plus Rockchip has low level access while I can only use their MatMul API. etc.. |
Thank you for the answer @marty1885. I might try to integrate the opensource RE npu driver from https://blog.tomeuvizoso.net/search/label/rk3588 |
@vincenzodentamaro The OSS driver is yet to be documented (document is critical as the user space control ties very deeply into how th NPU hardware works). I have contacted the author 2 weeks ago. He is busy on personal subjects and will write the docs afterwards. Currently the Mesa code is the only document we got. And I'm not going to read that thousands of lines of magic. Please be patient while things progress. I too want to have the NPU be useful. |
Slightly unrelated to this very topic but a new small AI board dropped recently: http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-AIpro(20t).html with significant performance boost compared to the RK3588 (20TOPS compared to 6 TOPS) , while retaining similarity in the way the npu acceleration is done and similar price as well. RK3588 seems to start to show its age.. |
I couldn't find any sdk or opensource code to make its npu work. |
I suspect it might still be a bit constrained wrt LLM's too: It's LPDDR4X. The RK3588 has some boards available (Orange Pi 5 Max and CM3588 Pro) that are LPDDR5. Given that generation is mostly I/O bound, I think that RAM bandwidth might have more bearing on performance? Could probably work well for StableDiffusion though as I think that's more Compute Bound? If we end up with a GGML backend for the Rockchip NPU eventually, would be very keen on seeing how it performs with SD. Vulkan on an AMD 5600G APU yielded > 50% performance improvement over CPU for me with StableDiffusion.cpp ( leejet/stable-diffusion.cpp#291 (comment) ) |
I just want to make a correction - it is LPDDR4X, but based on pictures, it looks like it's tri-channel (three RAM chips). Also, I did some research - it's using a Huawei AI chip (same as "Ascend" as I gather) which means it should be compatible with CANN (which looks to already have a GGML backend?) This model is unavailable outside of China right now though. I suspect that has something to do with it being a Huawei chip. |
You are correct. Also the images it comes with only has mirrors in China. Thus downloading anything goes through the GFW and is very slow. |
Hello, I want to run my own LLM (linear attention) on the RK3588 with npu. I noticed that rknn-llm provides very few interfaces(Most of the code has been encapsulated into .so files, so I feel like it might be almost impossible to adapt model to it). Would it be better to make modifications directly on your fork (llama.cpp) instead?Or do you have a better idea? thank you! |
@guoguo1314 See my above comment |
Emmm, first of all, thank you for your reply. I'm new to RK3588,so I have a lot of basic questions, please don't find it troublesome, haha. I've already run your forked code on llama-7b-4bit, and I've read through the discussions above. However, I still have some questions: I have doubts about whether rknn-llm can adapt to my model, because most of the critical code is encapsulated in .so files, making it almost impossible to adapt the model (I need to confirm this, as I'm afraid it might be adaptable, but I haven't tried adapting it). If it's not possible, I'll try modifying your forked code to adapt it to my model |
RKLLM is a compiler-runtime architecture. Rockchip has a track record of being bad at software - their compiler can't compile most models too. It's not about them shipping a closed source blob. It's their compiler doesn't work in most cases. The only thing us outsiders can do is to wait for Rockchip to fix their code. With that said, we can't progress on my open source RK3588 backend either. The official RKNPU2 runtime has limitations (matmul only, doesn't provide low level access). While the open source driver Tomeu wrote is not documented (Tomeu is busy at his job right now). Having the source code of the driver is not sufficient in this case. We also need to understand how to issue commands and the format of the commands the NPU uses. There's little can be done in this stage unless you want to read the code in the Mesa NPU backend that Tmoeu wrote, and understand how to use the driver that way... To me the ROI is way too low. I'd wait for Tomeu to finish the document. |
Thank you for your answer, and I will continue to discuss if there are any questions. |
I think we are already able to make a rk3588 llm inference program better than rkllm.
But still, there are problems:
|
hello ! I have the following questions. As shown in the code below, when using GGML_USE_RKNPU2, the backend selected is GGML_BACKEND_CPU and GGML_BACKEND_GPU, but it does not choose npu as the backend, or rather, how is npu acceleration being utilized?
Then, I used your ggml-rknpu2.c to load part of the matrix multiplication computation onto the npu in rwkv.cpp. In this part of the code in rwkv.cpp/rwkv.cpp:
What backend should be chosen here? |
@guoguo1314 In my fork. CMake adds the flag TBH, I have considered porting the RKNPU2 code into RWKV.cpp. But rwkv.cpp has been stuck on the pre-GGUF version of GGML so there's no proper backend framework in place. |
This is why I'm waiting for the documents for the FOSS driver. That solves these 2 problems right away.
No worries. I've been writing a backend for Tenstorrent's Metalium framework against up to date GGML. Porting the RK3588 will be a complete rebuild but I know what to do.
This is my major concern. Relayout is really slow. So slow that it might not be faster unless we can map most if not all operators onto the NPU. However, my experience from building the Metalium backend tells me that GGML really wants the tensors in row-major format. Currently all backends (including the NCNN one!) and frontend code assumes view and reshape are piratically free. Which is only true under row-major. |
Hi all, please forgive me if this is a naive question, but I recently noticed that the RK3588 datasheet lists it as supporting a Quad-Channel Memory configuration. https://www.cnx-software.com/pdf/Rockchip%C2%A0RK3588%C2%A0Datasheet%C2%A0V0.1-20210727.pdf Given that much of the bottleneck on LLM's is with memory-bandwidth, does this suggest that an RK3588 SBC could "potentially" integrate four DDR5 RAM chips, thereby giving us up to 4x the bandwidth of a single DDR5? Or is there another constraint on the RK3588 somewhere that would prevent this? I have searched around and, though I've found what look like dual-channel DDR4 SBC's (e.g. OPi5), I don't think I've seen any that are Quad-Channel (or Dual-Channel DDR5). If this is possible though, the RK3588 (if it's NPU was well supported) might make a better "local assistant" than I first thought. |
http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-5-Pro.html Search the code 'D8CSZ' marked on this board's dram chip in https://www.micron.com/sales-support/design-tools/fbga-parts-decoder gives the result part number |
Even more information about the official Do you know there is a function in python called import rkllm.base.common
help(rkllm.base.common)
|
@marty1885 there was an update two weeks ago from these guys. https://github.com/airockchip/rknn-toolkit2/releases/tag/v2.2.0 |
I came back to this as I was looking what the situation is nowadays. Armbian does not seem to compile the rknpu driver required natively into their kernels, which is a bit unfortunate, and there also does not seem to be a DKMS package. However, this is looking pretty neat regardless. I have a Radxa ITX board now - so I'll see what I can get done now. Thank you for all the links, genuely impressive! |
@IngwiePhoenix Yeah, the driver situation isn't ideal and I would like to run the NPU on more recent kernels too. On my Armbian Orange Pi 5 I'm running Linux kernel 6.11.6 with Armbian patches. I gave the DKMS driver a shot, but it doesn't fully compile yet: https://github.com/bmilde/rknpu-driver-dkms Please contact me if you can help. There are a few rockchip headers (and rockchip specific functionality) that the Armbian current kernel doesn't contain. Maybe these can be copied into the DKMS driver, so that it is self contained. |
rknn toolkit added Arm support also here the numbers i got in llamacpp 9 months ago cpu only on the rock 5b running dietpiOS 3b Q4KM = 6.8t/s |
Sounds nice but the problem with Rockchip is that they refuse to open-source it and basically the community is tired of waiting for months for them to catchup to the latest updates. |
Can't wait to see where things will go once mainline support is in, been playing with this for a few days and its pretty fast with the NPU! |
Just did a very simple run with llama-7b-4bit. It... took a while. Had it run in a screen. But, it worked!
Model was loaded from external microSD via internal bus.
Im quite amazed this worked at all, honestly.
CPU Info in detail:
(
/proc/cpuinfo
doesnt give any more useful details here, sadly.)Hardware is a FriendlyElec NanoPi R6s
The text was updated successfully, but these errors were encountered: