Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rockchip RK3588 perf #722

Closed
IngwiePhoenix opened this issue Apr 2, 2023 · 103 comments
Closed

Rockchip RK3588 perf #722

IngwiePhoenix opened this issue Apr 2, 2023 · 103 comments

Comments

@IngwiePhoenix
Copy link

Just did a very simple run with llama-7b-4bit. It... took a while. Had it run in a screen. But, it worked!

root@FriendlyWrt /s/o/llama.cpp (master)# time ./main --color -m models/ggml-model-q4_0.bin -p "Hello there!"
main: seed = 1680443840
llama_model_load: loading model from 'models/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'models/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0


 Hello there! I am a freelance illustrator based in New Zealand. I grew up with an appreciation for the natural world, which has inspired me to create my work through observation and playful experimentation.
My focus is on watercolour painting (in particular), as well as digital art & animation. My style is bright & bold, vibrant, dynamic & colourful - I love animals!
I am always keen to collaborate with other artists/creatives, so if you are interested in working together please feel free to drop me a line. [end of text]

llama_print_timings:        load time = 93487.23 ms
llama_print_timings:      sample time =   704.72 ms /   115 runs   (    6.13 ms per run)
llama_print_timings: prompt eval time = 92466.10 ms /     4 tokens (23116.52 ms per token)
llama_print_timings:        eval time = 11195694.23 ms /   114 runs   (98207.84 ms per run)
llama_print_timings:       total time = 11289895.19 ms

________________________________________________________
Executed in  188.18 mins    fish           external
   usr time  324.60 mins    0.00 millis  324.60 mins
   sys time   11.70 mins    1.70 millis   11.70 mins

Model was loaded from external microSD via internal bus.

Im quite amazed this worked at all, honestly.

CPU Info in detail:

# lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           Cortex-A55
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r2p0
    CPU(s) scaling MHz: 100%
    CPU max MHz:        1800.0000
    CPU min MHz:        408.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
  Model name:           Cortex-A76
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          2
    Stepping:           r4p0
    CPU(s) scaling MHz: 68%
    CPU max MHz:        2352.0000
    CPU min MHz:        408.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Caches (sum of all):
  L1d:                  384 KiB (8 instances)
  L1i:                  384 KiB (8 instances)
  L2:                   2.5 MiB (8 instances)
  L3:                   3 MiB (1 instance)
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Vulnerable: Unprivileged eBPF enabled
  Srbds:                Not affected
  Tsx async abort:      Not affected

(/proc/cpuinfo doesnt give any more useful details here, sadly.)

Hardware is a FriendlyElec NanoPi R6s

@prusnak
Copy link
Collaborator

prusnak commented Apr 2, 2023

Thanks for the info

@prusnak prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2023
@spv420
Copy link

spv420 commented Apr 3, 2023

i wonder if it could be made faster by making sure the model is in RAM

maybe see if subsequent runs are faster once the model is cached?

@jimtendo
Copy link

Thanks for posting this.

Just as a heads up, the RK3588 does have NPU units on it but these are not leveraged with the llama.cpp codebase (at time of writing). If other devs are interested, the NPU API for this can be found in this file: https://github.com/rockchip-linux/rknpu2/blob/master/runtime/RK3588/Linux/librknn_api/include/rknn_api.h

Note: I'm sure I've read somewhere that INT4 tensors should be supported, but I cannot see them in that API. Also, I believe the model might have to be converted to a specific RK3588 format (toolkit link in the root README.md)?

I did actually expect far better performance even with the CPUs only though with a 7B model. I notice this is an 8GB RK3588, so maybe there was a lot of memory swapping happening that slowed it down.

I don't have any chips with RK3588 yet, but if I manage to get one, I'll try to do some testing on my side. Might make great little units for running a dedicated assistant on if it can be optimized well.

@IngwiePhoenix
Copy link
Author

If there is a specific test you want me to run, let me know!

I don't have any swap configured, regrettably. But what could easily have happened is that because this was running literally alongside my homeserver stuff, that memory management on th e kernel side got quite hectic. :)

Also, llama.cpp has improved a lot since last time - so I might just rerun the test, to see what happens. Also, Vicuna and StableLM are a thing now. Might as well give it a shot... that said, I'd have to think of a good way to gather the output into a nice table structure - because I don't want to flood this ticket, or anyone else, with a crapton of redundant output. xD

That all said, there is one more thing:

# dmesg -l err | grep -i npu
[    3.702909] RKNPU fdab0000.npu: can't request region for resource [mem 0xfdab0000-0xfdabffff]
[    3.702953] RKNPU fdab0000.npu: can't request region for resource [mem 0xfdac0000-0xfdacffff]
[    3.702978] RKNPU fdab0000.npu: can't request region for resource [mem 0xfdad0000-0xfdadffff]
[    3.707178] debugfs: Directory 'fdab0000.npu-rknpu' with parent 'vdd_npu_s0' already present!
[    3.729270] RKNPU fdab0000.npu: failed to find power_model node
[    3.729289] RKNPU fdab0000.npu: RKNPU: failed to initialize power model
[    3.729297] RKNPU fdab0000.npu: RKNPU: failed to get dynamic-coefficient

Thanks to RockChip's - at least in my experience - rather spotty documentation, I couldn't figure out if these messages were relevant or not. Though it'd actually be interesting to see INT4 on this.

@jimtendo
Copy link

I did a quick test with this on Orange Pi 5 16GB using a 7B Q5_1 model. My setup is a bit clunky, so I don't have a proper benchmark (will re-run and edit in next week when I'm setup better), but I'd estimate performance at around at almost 1 token/sec. This was using 7 threads. The heatsink became pretty hot to touch - I suspect the slower performance above might've been due to either a) memory constraints or b) thermal throttling.

Would love to see how well this could run if leveraging the NPU, but I don't think the RK SDK supports INT4 quant yet. Basically, the RK process is that the models have to be converted into an RK-compatible format using their SDK's, so the quantization probably won't be great using that approach.

I haven't looked into whether RK API is low-level enough that it might be able to support running GGML models yet, but that'd probably work better than using whatever quantization process RK SDK may eventually support

@jimtendo
Copy link

jimtendo commented May 24, 2023

I tinkered around a bit more with this last night.

I was able to get around 500ms/token using 4 threads on a 7B Q5_1.

I also played around with the new OpenCL implementation (using CLBlast), but this was significantly slower if I transfer all layers to GPU (> 1s/token). I don't have time to thoroughly investigate but, looking at the GGML OpenCL implementation, I suspect a lot of the slowdown might be how memory is handled.

In the OpenCL implementation, it looks like the tensors might be copied to the GPU as opposed to using a pointer to the Host Memory (I noticed some loops in there that do this). This makes sense for non-iGPUs (as they have their own VRAM), but probably results in unnecessary copy op's for devices with shared RAM/VRAM like the RK3588 (and AMD APU's for that matter). I believe there are flags that can be used to simply point OpenCL to host memory, but I'm unsure whether it would be compatible with the GGML tensor format. Might be a worthy optimization to consider though if it would speed up inference on AMD APU's also.

Side-note: I have tiny heatsinks on my Orange Pi 5. These get quite hot and I notice inference time slows down quite a bit as they heat up, so assuming the device gets underclocked to maintain safe temperatures.

@spv420
Copy link

spv420 commented May 26, 2023

if i had a 3588 i'd totally be down to fuck around with this, can anyone point me to a relatively-cheap 3588 dev board?

edit: 8GB if possible

@jimtendo
Copy link

I probably can't recommend a specific board sorry. I haven't priced them out.

Just want to add to this though - the guy that's been doing a lot of the work on the llama.cpp GPU implementations isn't sure if optimizations to the OpenCL code will yield that much benefit for boards like this. He posted the following graph yesterday indicating that the big bottleneck appears to be memory.

image

@marty1885
Copy link

Not sure if this helps the discussion. I made a fork that supports the RK3588 NPU via the matrix multiplication API. Unfortunately it is not faster then just using the CPU and generates questionable output due to running in int8 mode (FP16 is too slow).

Feel free to contribute, and see if anyone can work around the accuracy issue. I have a prototype that gets up to 10% faster by chunking operations. But it's complicated and I feel not worth the work if all I'm able to get is hallucinating outputs.

I'd love to upstream the code. Please contribute if you are also interested in the subject

https://github.com/marty1885/llama.cpp/tree/rknpu2-backend

@jimtendo
Copy link

Not sure if this helps the discussion. I made a fork that supports the RK3588 NPU via the matrix multiplication API. Unfortunately it is not faster then just using the CPU and generates questionable output due to running in int8 mode (FP16 is too slow).

Feel free to contribute, and see if anyone can work around the accuracy issue. I have a prototype that gets up to 10% faster by chunking operations. But it's complicated and I feel not worth the work if all I'm able to get is hallucinating outputs.

I'd love to upstream the code. Please contribute if you are also interested in the subject

https://github.com/marty1885/llama.cpp/tree/rknpu2-backend

Thanks for this! I looked into it at one point too, but I think the bottleneck will be the RAM speed on the Pi 5? This approach might still be able to speed up prompt ingestion substantially though.

Do you know if using the NPU reduces power consumption? I'm an idiot and installed a tiny heatsink on my Pi 5, so it throttles very quickly.

Will try and give your fork a go next week when I get some time.

@marty1885
Copy link

marty1885 commented Oct 26, 2023

but I think the bottleneck will be the RAM speed on the Pi 5?

No, the NPU on the RK3588 is really, really bad at matrix multiplication. It's designed for vision models thus focused on convolution. It has a pretty low FLOPS when doing matrix multiplication.

This approach might still be able to speed up prompt ingestion substantially though.

Maybe, but the inaccuracy is quite significant. I am not sure what'll happen.

Do you know if using the NPU reduces power consumption? I'm an idiot and installed a tiny heatsink on my Pi 5, so it throttles very quickly.

I think it can. But not with my backend in the current state. My backend only uses 1 thread out out all given by GGML. And GGML will spin non-working threads. It's a design flaw in GGML itself and needs major refactor. Can't just use 1 thread either. Some matrices are too large to fit on the NPU. It's possible to split the work and distribute to different NPU cores. But I it's too much work for little gain (as the model is hallucinating constantly).

To compile and run my fork. I don't recommend running more then 13 layers or a 7B model on the NPU. It starts going crazy afterwards. I develop with 10.

cmake .. -DLLAMA_RKNPU2=ON
make -j
./bin/main .... -ngl 10

Also you need a Q8_0 model. It's kinda moot for lower bits since the minimal supported by the NPU is 8.

@psyhtest
Copy link

psyhtest commented Jan 2, 2024

Great thread! I have a Firefly RK3588S board, so it would be great to try this out. Don't have much hope for the NPU, but am wondering if offloading matrix multiplications to the Arm Mali GPU via Arm Computer Library might be worthwhile? Any thoughts?

@marty1885
Copy link

marty1885 commented Jan 2, 2024

@prusnak I tried something similar with GGML's OpenCL backend way back. I modified it enough to get RWKV (not llama) running on the Mali GPU. it has many problems. Mainly

  1. ARM's OpenCL implementation is buggy and doesn't play nice with GGML
  2. For some reason, the OpenCL latency is very high on my OrangePi 5
  3. Decompressing k-quants requires a lot of integer operations. But the Mali GPU has 1/4 integer capacity compared to floating point.

ACL can work. But I have question if it'll be helpful. GGML pre-transposes matrix B in matmul(A, B). Thus access pattern is already as efficient as it can. IMO most OpenCL compilers can easily optimize that (need to confirm by decompiling though). After getting GGML working on Mali. You'll have to choose. Either to not support k-quants and run into the same accuracy vs bandwidth tradeoff as I do with the NPU. Or support k-quants and make k-quant decompression fast on the Mali.

Good luck. I'd love to see more LLMs on the edge.

====

For anyone interested; progress update on my side. With RKNPU2 1.6.0. It almost makes sense to use the NPU. I'm less then 10% off to being faster then the CPU on INT8 mode with just 1 NPU core. Next step is to debug non-square matrix multiplication. Something somewhere is wrong.

I won't update every step here. Please either follow my fork or check my blog from time to time.
Latest progress: https://clehaxze.tw/gemlog/2023/12-17-update-on-ggml-rknpu2-backend-and-rknpu2-1_6_0.gmi

@ggerganov
Copy link
Member

@marty1885 Your work is very interesting. Have you considered running Whisper models on the NPU? Could be better suited as the models are much smaller compared to 7B LLMs and would immediately have various real-world applications.

@marty1885
Copy link

marty1885 commented Jan 2, 2024

@ggerganov Thanks, Already done by other people. https://github.com/usefulsensors/useful-transformers runs Whisper on the NPU. They are able to do much extensive optimizations compared to GGML though. The NPU demands a custom matrix layout for maximal performance. And they are able to eliminate a majority of layout conversions by abstracting them away.

Actually good idea. I can try targeting my work against whisper.cpp. Do you know any use cases for it? And what would be the process to upstream an entire new backend?

@ggerganov
Copy link
Member

From quick look at this repo, it looks like they use the NPU just for the matrix multiplications. All other operations, such as convolutions, softmax, layernorm, etc. are on the CPU. Does the NPU API allow to implement all other ops or is it limited just to matrix multiplications?

The reason I'm wondering is that ggml currently does not provide an efficient way to run both CPU and NPU ops in a single compute graph, because the CPU threads must remain spinning while the NPU is doing stuff. So it would be much better if we could offload the entire compute on the NPU and leave the CPU idle. Starting and stopping threads can become quite expensive, especially for smaller models, so that's why it should be avoided.

Still, if it is not possible for the NPU to do general computations, then we can perform just the heavy matrix operations in the Whisper Encoder in a similar way as we currently use BLAS. I think you've already prototyped this to a good extend in your fork. Some of the smaller matrix multiplication probably should remain on the CPU - needs experimenation.

I don't see a way around reshuffling the tensor data to fit the NPU layout. This will be some overhead that the NPU backend implementation would have to perform on the input and output data.

As long as the changes are contained as much as possible in ggml-npu.h/ggml-npu.c, it should be easy to upstream. We would need to make some basic CI and I don't see a problem with having the backend merged, given that we see performance / energy gains.

@marty1885
Copy link

Does the NPU API allow to implement all other ops or is it limited just to matrix multiplications?

For now it is limited to only matrix multiplications. Softmax, convolution, etc.. are locked behind their ONNX compiler and is not open source.

Yeah, reordering is a major performance bottleneck right now. I hope the vendor can solve this or at least mitigate it largely. I hope future chip designers can make data layout easy and expose more low level API.

I'll submit a PR if I made it useful/new SDK solve current problems.

@mtx512
Copy link

mtx512 commented Jan 4, 2024

@marty1885 I'm in the midst of trying to reverse engineering parts of the RK3588 NPU as I'm am keen to understand how the matrix multiplication was handled by the NPU to see if it could be optimised/open sourced. From your testing for fp16 do have any insight in to how large the matrices get for llama 7b. I'm assuming they can't be larger than [512x512] x [512x512] as that would already require 0.5Mb of memory for the output for a single operation.

@marty1885
Copy link

@mtx512
There are 2 kinds of matrix multiplications in llama. One for the dot-product attention. Another for token processing. I never saw the the [N x N x N] matrix multiplication hit my backend. I assume either llama.cpp have special code path to handle it. Or it failed the NPU compatibility check since I only implemented matrix relayout during initialization. More likely it is the latter. I never tried to debug this since relayout is very slow and simply not worth on the fly.

The regular matrix multiplications on encoder/decoder weights are more like GEMV instead of GEMM. They have shape basically the following (note that in GGLM's source code src0 is matrix B for RKNN API. And src1 is A).

  • A: [batch x 4096], B: [4096, 4096]
  • B: [batch x 4096], B: [4096, 10240]

batch here is the number of tokens in process. During prompts processing this is the number of tokens up to some parameter that can be controlled by CLI. IIRC default max is 512. And 1 during text generation. Instead of optimizing for matrix multiplication. I think it'll be much more beneficial to optimize for matrix-vector multiplication if possible, since that's what the vast majority of time spent during generation. Also nice if we can offload softmax from the CPU.

Good luck! Hope you find success.

@happyme531
Copy link

@marty1885 I'm in the midst of trying to reverse engineering parts of the RK3588 NPU as I'm am keen to understand how the matrix multiplication was handled by the NPU to see if it could be optimised/open sourced. From your testing for fp16 do have any insight in to how large the matrices get for llama 7b. I'm assuming they can't be larger than [512x512] x [512x512] as that would already require 0.5Mb of memory for the output for a single operation.

I doubt the NPU can actually run MatMul "natively" with matrix size >= 256x256. (for ONNX models, MatMul with size equal or larger than 256x256 cannot run on NPU!)

@happyme531
Copy link

@prusnak I tried something similar with GGML's OpenCL backend way back. I modified it enough to get RWKV (not llama) running on the Mali GPU. it has many problems. Mainly

1. ARM's OpenCL implementation is buggy and doesn't play nice with GGML

2. For some reason, the OpenCL latency is very high on my OrangePi 5

3. Decompressing k-quants requires a lot of integer operations. But the Mali GPU has 1/4 integer capacity compared to floating point.

ACL can work. But I have question if it'll be helpful. GGML pre-transposes matrix B in matmul(A, B). Thus access pattern is already as efficient as it can. IMO most OpenCL compilers can easily optimize that (need to confirm by decompiling though). After getting GGML working on Mali. You'll have to choose. Either to not support k-quants and run into the same accuracy vs bandwidth tradeoff as I do with the NPU. Or support k-quants and make k-quant decompression fast on the Mali.

Good luck. I'd love to see more LLMs on the edge.

====

For anyone interested; progress update on my side. With RKNPU2 1.6.0. It almost makes sense to use the NPU. I'm less then 10% off to being faster then the CPU on INT8 mode with just 1 NPU core. Next step is to debug non-square matrix multiplication. Something somewhere is wrong.

I won't update every step here. Please either follow my fork or check my blog from time to time. Latest progress: https://clehaxze.tw/gemlog/2023/12-17-update-on-ggml-rknpu2-backend-and-rknpu2-1_6_0.gmi

TVM has better support for Mali GPU with OpenCL. See MLC-LLM project. Also I have tried to run some other small models that cannot run effectively on NPU on GPU, and it performs pretty good.

@happyme531
Copy link

happyme531 commented Jan 8, 2024

RKNPU2 memory allocation size limit issue have been resolved in my fork by happyme531@eaf7a15
But after testing there are still output quality issues even in fp16 precision. Don't know why.

@marty1885
Copy link

marty1885 commented Jan 8, 2024

@happyme531 Looks like you are right. The 1.6.0 SDK does state that the product between channels cannot be >= 65532. Maybe this is the reason? They forgot to document this limitation for the matmul API?

(For the people in this thread whom can't read Chinese, trust me)
On Page 57 of 05_RKNN_Compiler_Support_Operator_List_v1.6.0.pdf
image

I've merged your fix into my fork.

@mtx512
Copy link

mtx512 commented Jan 11, 2024

RKNPU2 memory allocation size limit issue have been resolved in my fork by happyme531@eaf7a15 But after testing there are still output quality issues even in fp16 precision. Don't know why.

RK3588 NPU data pointers are limited to 31:0 bits (based on TRM) hence the 4GB limit. Curious why you think it can be larger?

@happyme531
Copy link

happyme531 commented Jan 12, 2024

RKNPU2 memory allocation size limit issue have been resolved in my fork by happyme531@eaf7a15 But after testing there are still output quality issues even in fp16 precision. Don't know why.

RK3588 NPU data pointers are limited to 31:0 bits (based on TRM) hence the 4GB limit. Curious why you think it can be larger?

Honestly I do not know this limit when writing this fix. No document ever mentioned it. And the resulting code runs smoothly without a single error(except the output quality issue which have many potential causes).
(Probably there is actually not a issue, some sort of workaround about using >4GB memory is present inside rknn library?)

@mtx512
Copy link

mtx512 commented Jan 12, 2024

RKNPU2 memory allocation size limit issue have been resolved in my fork by happyme531@eaf7a15 But after testing there are still output quality issues even in fp16 precision. Don't know why.

RK3588 NPU data pointers are limited to 31:0 bits (based on TRM) hence the 4GB limit. Curious why you think it can be larger?

Honestly I do not know this limit when writing this fix. No document ever mentioned it. And the resulting code runs smoothly without a single error(except the output quality issue which have many potential causes). (Probably there is actually not a issue, some sort of workaround about using >4GB memory is present inside rknn library?)

The RKNN docs mention Zero-Copy apis, for these the memory has to be compatible with the NPU, so for RK3588 this would a 32 bit address in physical memory. If your providing a physical address over 4GB I'd suspect it just truncating it to 32 bits so using a random location. If you provide a virtual address then it has copy the data to a physical location in 32bit range hence performance drop.

@jimtendo
Copy link

jimtendo commented Jan 12, 2024

RKNPU2 memory allocation size limit issue have been resolved in my fork by happyme531@eaf7a15 But after testing there are still output quality issues even in fp16 precision. Don't know why.

RK3588 NPU data pointers are limited to 31:0 bits (based on TRM) hence the 4GB limit. Curious why you think it can be larger?

Honestly I do not know this limit when writing this fix. No document ever mentioned it. And the resulting code runs smoothly without a single error(except the output quality issue which have many potential causes). (Probably there is actually not a issue, some sort of workaround about using >4GB memory is present inside rknn library?)

The RKNN docs mention Zero-Copy apis, for these the memory has to be compatible with the NPU, so for RK3588 this would a 32 bit address in physical memory. If your providing a physical address over 4GB I'd suspect it just truncating it to 32 bits so using a random location. If you provide a virtual address then it has copy the data to a physical location in 32bit range hence performance drop.

Are we certain there is a constraint on 32bit PHYSICAL memory address? Looking at the RK NPU API here:

https://github.com/rockchip-linux/rknpu2/blob/master/runtime/RK3588/Linux/librknn_api/include/rknn_api.h#L348

... the physical address is defined as a uint64_t.

Also, regarding the FP16 constraint, is this a hardware limitation? In theory, it looks like it should be able to support 8bit.

https://github.com/rockchip-linux/rknpu2/blob/master/runtime/RK3588/Linux/librknn_api/include/rknn_api.h#L144

I've yet to play with any of this though, so take the above with a grain of salt.

EDIT: Looking at that structure a bit deeper, it looks like there is a 32bit constraint on the tensors themselves. But, if these do not have to sit (or be copied) to first 4GB of physical memory, might it be possible - given that memory is shared - to take an approach where we process with the NPU layer-at-a-time?

@marty1885
Copy link

marty1885 commented Jan 13, 2024

Also, regarding the FP16 constraint, is this a hardware limitation? In theory, it looks like it should be able to support 8bit.

It's both. GGML doesn't natively do quantized inference. "quantization" to GGMl means compressing the weights, decompress it on the fly and keep it in cache. The decompressed result in still floating point and GGML does all it's math in floating point (FP32 on CPU and optionally FP16 on GPU)

This is while the NPU expects both matrices to be the same type - both FP16 or INT8. I tried converting both weight and input into fixed point (INT8). It seems the network needs more accuracy then 8 bits else goes crazy if too many layers are run in this very limited accuracy.

It would be perfect if RKNN can support weights in INT8/INT4 fixed point but keep inputs in FP16. But I doubt that since the NPU is more like a fixed pipeline GPU in the old days.

@vincenzodentamaro
Copy link

vincenzodentamaro commented Aug 14, 2024

Any comparison in speed token/seconds between rkllm and this version of llamacpp with npu enabled on rk3588 same quantized model like phi3 mini?

@marty1885
Copy link

@vincenzodentamaro Never tested. But I assume RKLLM is much faster. My backend was an experiment and never well optimized. Plus Rockchip has low level access while I can only use their MatMul API. etc..

@vincenzodentamaro
Copy link

Thank you for the answer @marty1885. I might try to integrate the opensource RE npu driver from https://blog.tomeuvizoso.net/search/label/rk3588
What do you think?

@marty1885
Copy link

marty1885 commented Aug 14, 2024

@vincenzodentamaro The OSS driver is yet to be documented (document is critical as the user space control ties very deeply into how th NPU hardware works). I have contacted the author 2 weeks ago. He is busy on personal subjects and will write the docs afterwards.

Currently the Mesa code is the only document we got. And I'm not going to read that thousands of lines of magic.

Please be patient while things progress. I too want to have the NPU be useful.

@haixuanTao
Copy link

Slightly unrelated to this very topic but a new small AI board dropped recently: http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-AIpro(20t).html with significant performance boost compared to the RK3588 (20TOPS compared to 6 TOPS) , while retaining similarity in the way the npu acceleration is done and similar price as well.

RK3588 seems to start to show its age..

@happyme531
Copy link

happyme531 commented Aug 16, 2024

Tested their group quant matmul API on RK3588:

  • only supports RKNN_INT8_MM_INT8_TO_FLOAT32 (so you have to find a way to quant the activations to int8 on the fly). otherwise the quantization param will not work at all.

  • only supports the weight matrix. (so the activations can't even be quantized by group!) (edit: seems this is not true since the weight can be mat A in matmul AxB->C)

  • only supports RKNN_QUANT_TYPE_PER_GROUP_SYM (zeropoint=0), otherwise the program will segfault on callingrknn_matmul_set_quant_params (terrible error handling)

  • only supports native layout

  • speed test result of [256, k] x [k,n] where k,n in {64, 128, 256, 512, 1024, 2048, 4096} with various group size:
    result

  • Maybe this issue should be turned into a discussion, or better, a discord server?

@vincenzodentamaro
Copy link

Slightly unrelated to this very topic but a new small AI board dropped recently: http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-AIpro(20t).html with significant performance boost compared to the RK3588 (20TOPS compared to 6 TOPS) , while retaining similarity in the way the npu acceleration is done and similar price as well.

RK3588 seems to start to show its age..

I couldn't find any sdk or opensource code to make its npu work.

@jimtendo
Copy link

jimtendo commented Aug 17, 2024

I couldn't find any sdk or opensource code to make its npu work.

I suspect it might still be a bit constrained wrt LLM's too: It's LPDDR4X. The RK3588 has some boards available (Orange Pi 5 Max and CM3588 Pro) that are LPDDR5. Given that generation is mostly I/O bound, I think that RAM bandwidth might have more bearing on performance?

Could probably work well for StableDiffusion though as I think that's more Compute Bound? If we end up with a GGML backend for the Rockchip NPU eventually, would be very keen on seeing how it performs with SD. Vulkan on an AMD 5600G APU yielded > 50% performance improvement over CPU for me with StableDiffusion.cpp ( leejet/stable-diffusion.cpp#291 (comment) )

@jimtendo
Copy link

It's LPDDR4X

I just want to make a correction - it is LPDDR4X, but based on pictures, it looks like it's tri-channel (three RAM chips). Also, I did some research - it's using a Huawei AI chip (same as "Ascend" as I gather) which means it should be compatible with CANN (which looks to already have a GGML backend?)

This model is unavailable outside of China right now though. I suspect that has something to do with it being a Huawei chip.

@marty1885
Copy link

This model is unavailable outside of China right now though.

You are correct. Also the images it comes with only has mirrors in China. Thus downloading anything goes through the GFW and is very slow.

@guoguo1314
Copy link

guoguo1314 commented Aug 26, 2024

Hello, I want to run my own LLM (linear attention) on the RK3588 with npu. I noticed that rknn-llm provides very few interfaces(Most of the code has been encapsulated into .so files, so I feel like it might be almost impossible to adapt model to it). Would it be better to make modifications directly on your fork (llama.cpp) instead?Or do you have a better idea? thank you!

@marty1885
Copy link

@guoguo1314 See my above comment
#722

@guoguo1314
Copy link

guoguo1314 commented Aug 26, 2024

Emmm, first of all, thank you for your reply. I'm new to RK3588,so I have a lot of basic questions, please don't find it troublesome, haha. I've already run your forked code on llama-7b-4bit, and I've read through the discussions above. However, I still have some questions: I have doubts about whether rknn-llm can adapt to my model, because most of the critical code is encapsulated in .so files, making it almost impossible to adapt the model (I need to confirm this, as I'm afraid it might be adaptable, but I haven't tried adapting it). If it's not possible, I'll try modifying your forked code to adapt it to my model

@marty1885
Copy link

marty1885 commented Aug 26, 2024

@guoguo1314

RKLLM is a compiler-runtime architecture. Rockchip has a track record of being bad at software - their compiler can't compile most models too. It's not about them shipping a closed source blob. It's their compiler doesn't work in most cases. The only thing us outsiders can do is to wait for Rockchip to fix their code.

With that said, we can't progress on my open source RK3588 backend either. The official RKNPU2 runtime has limitations (matmul only, doesn't provide low level access). While the open source driver Tomeu wrote is not documented (Tomeu is busy at his job right now). Having the source code of the driver is not sufficient in this case. We also need to understand how to issue commands and the format of the commands the NPU uses. There's little can be done in this stage unless you want to read the code in the Mesa NPU backend that Tmoeu wrote, and understand how to use the driver that way... To me the ROI is way too low. I'd wait for Tomeu to finish the document.

@guoguo1314
Copy link

Thank you for your answer, and I will continue to discuss if there are any questions.

@happyme531
Copy link

happyme531 commented Aug 26, 2024

@guoguo1314

RKLLM is a compiler-runtime architecture. Rockchip has a track record of being bad at software - their compiler can't compile most models too. It's not about them shipping a closed source blob. It's their compiler doesn't work in most cases. The only thing us outsiders can do is to wait for Rockchip to fix their code.

With that said, we can't progress on my open source RK3588 backend either. The official RKNPU2 runtime has limitations (matmul only, doesn't provide low level access). While the open source driver Tomeu wrote is not documented (Tomeu is busy at his job right now). Having the source code of the driver is not sufficient in this case. We also need to understand how to issue commands and the format of the commands the NPU uses. There's little can be done in this stage unless you want to read the code in the Mesa NPU backend that Tmoeu wrote, and understand how to use the driver that way... To me the ROI is way too low. I'd wait for Tomeu to finish the document.

I think we are already able to make a rk3588 llm inference program better than rkllm.

  • The iommu switching function works and enables us to load >4GB weight into the NPU.
  • There are already some w8a8 quant methods around, like SmoothQuant, or by decomposing the weight activation like mllm-NPU (https://arxiv.org/pdf/2407.05858v1)
  • For other ops, we can just use cpu, or use ONNX to define op kernel and call them using standard RKNPU2 model runtime api.

But still, there are problems:

  • We need to handle special buffer layout used by NPU, either by convert back-forth before matmul, or just make other ops accept such a layout.
  • Since the RKNPU2 API is all blocking, multithreading is needed to do async work.
  • The ONNX to RKNN model compiler can only be run on a x86-64 machine. And can't handle arbritary shapes.
  • Finally, I have no idea how to implement a ggml backend after the api changed.

@guoguo1314
Copy link

hello ! I have the following questions. As shown in the code below, when using GGML_USE_RKNPU2, the backend selected is GGML_BACKEND_CPU and GGML_BACKEND_GPU, but it does not choose npu as the backend, or rather, how is npu acceleration being utilized?

enum ggml_backend_type llama_backend_offload = GGML_BACKEND_CPU;
enum ggml_backend_type llama_backend_offload_split = GGML_BACKEND_CPU;
#ifdef GGML_USE_CUBLAS
if (ggml_cublas_loaded()) {
    LLAMA_LOG_INFO("%s: using " GGML_CUDA_NAME " for GPU acceleration\n", func);
    ggml_cuda_set_main_device(main_gpu);
    llama_backend_offload = GGML_BACKEND_GPU;
    llama_backend_offload_split = GGML_BACKEND_GPU_SPLIT;
}
#elif defined(GGML_USE_CLBLAST)
    LLAMA_LOG_INFO("%s: using OpenCL for GPU acceleration\n", func);
    llama_backend_offload = GGML_BACKEND_GPU;
    llama_backend_offload_split = GGML_BACKEND_GPU;
#elif defined(GGML_USE_RKNPU2)
    LLAMA_LOG_INFO("%s: using RKNPU2 for NPU acceleration\n", func);
    llama_backend_offload = GGML_BACKEND_CPU;
    llama_backend_offload_split = GGML_BACKEND_GPU;
#endif

Then, I used your ggml-rknpu2.c to load part of the matrix multiplication computation onto the npu in rwkv.cpp. In this part of the code in rwkv.cpp/rwkv.cpp:

#ifdef GGML_USE_CUDA
    backend = ggml_backend_cuda_init(0);
    RWKV_ENSURE_OR_NULL(backend);
#endif
#ifdef GGML_USE_METAL
    backend = ggml_backend_metal_init();
    RWKV_ENSURE_OR_NULL(backend);
    ggml_backend_metal_set_n_cb(backend, ctx->n_threads);
#endif
#ifdef GGML_USE_BLAS
    backend = ggml_backend_blas_init();
    RWKV_ENSURE_OR_NULL(backend);
    ggml_backend_blas_set_n_threads(backend, ctx->n_threads);
#endif
// add
#ifdef GGML_USE_RKNPU2
    backend = ggml_backend_cpu_init();
    RWKV_ENSURE_OR_NULL(backend);
#endif

What backend should be chosen here?
So I have the following two questions:
Without selecting npu as the backend, how is npu acceleration utilized?
What kind of backend should be chosen in my situation?
Thank you!

@marty1885
Copy link

marty1885 commented Sep 3, 2024

@guoguo1314 In my fork. CMake adds the flag GGML_USE_RKNPU2 to compile definitions, which is what's actually causing GGML to use the NPU. llama_backend_offload_split = GGML_BACKEND_GPU is runtime and tells GGML that we are not using the CPU (since at that point GGML has no concept of an NPU) and I don't expect the OpenCL backend is enabled on the user build. I called it hack for a good reason.

https://github.com/marty1885/llama.cpp/blob/cc8cb95c4d8ab7a344fd80480124b0fa5092c926/CMakeLists.txt#L436

TBH, I have considered porting the RKNPU2 code into RWKV.cpp. But rwkv.cpp has been stuck on the pre-GGUF version of GGML so there's no proper backend framework in place.

@marty1885
Copy link

@happyme531

  • Since the RKNPU2 API is all blocking, multithreading is needed to do async work.
  • The ONNX to RKNN model compiler can only be run on a x86-64 machine. And can't handle arbritary shapes.

This is why I'm waiting for the documents for the FOSS driver. That solves these 2 problems right away.

  • Finally, I have no idea how to implement a ggml backend after the api changed.

No worries. I've been writing a backend for Tenstorrent's Metalium framework against up to date GGML. Porting the RK3588 will be a complete rebuild but I know what to do.

  • We need to handle special buffer layout used by NPU, either by convert back-forth before matmul, or just make other ops accept such a layout.

This is my major concern. Relayout is really slow. So slow that it might not be faster unless we can map most if not all operators onto the NPU. However, my experience from building the Metalium backend tells me that GGML really wants the tensors in row-major format. Currently all backends (including the NCNN one!) and frontend code assumes view and reshape are piratically free. Which is only true under row-major.

@jimtendo
Copy link

jimtendo commented Sep 9, 2024

Hi all, please forgive me if this is a naive question, but I recently noticed that the RK3588 datasheet lists it as supporting a Quad-Channel Memory configuration.

image

https://www.cnx-software.com/pdf/Rockchip%C2%A0RK3588%C2%A0Datasheet%C2%A0V0.1-20210727.pdf

Given that much of the bottleneck on LLM's is with memory-bandwidth, does this suggest that an RK3588 SBC could "potentially" integrate four DDR5 RAM chips, thereby giving us up to 4x the bandwidth of a single DDR5? Or is there another constraint on the RK3588 somewhere that would prevent this?

I have searched around and, though I've found what look like dual-channel DDR4 SBC's (e.g. OPi5), I don't think I've seen any that are Quad-Channel (or Dual-Channel DDR5).

If this is possible though, the RK3588 (if it's NPU was well supported) might make a better "local assistant" than I first thought.

@happyme531
Copy link

Hi all, please forgive me if this is a naive question, but I recently noticed that the RK3588 datasheet lists it as supporting a Quad-Channel Memory configuration.

image

https://www.cnx-software.com/pdf/Rockchip%C2%A0RK3588%C2%A0Datasheet%C2%A0V0.1-20210727.pdf

Given that much of the bottleneck on LLM's is with memory-bandwidth, does this suggest that an RK3588 SBC could "potentially" integrate four DDR5 RAM chips, thereby giving us up to 4x the bandwidth of a single DDR5? Or is there another constraint on the RK3588 somewhere that would prevent this?

I have searched around and, though I've found what look like dual-channel DDR4 SBC's (e.g. OPi5), I don't think I've seen any that are Quad-Channel (or Dual-Channel DDR5).

If this is possible though, the RK3588 (if it's NPU was well supported) might make a better "local assistant" than I first thought.

http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-5-Pro.html

Search the code 'D8CSZ' marked on this board's dram chip in https://www.micron.com/sales-support/design-tools/fbga-parts-decoder gives the result part number MT62F2G64D8CL-023 WT:B, then search again, you can find that it is already 64bit wide.

@happyme531
Copy link

happyme531 commented Sep 12, 2024

Even more information about the official "rkllm-toolkit":
(rkllm-1.0.2b6 obtained from https://console.zbox.filez.com/l/RJJDmB (password rkllm))

Do you know there is a function in python called help() that can show the content of a python module?

import rkllm.base.common
help(rkllm.base.common)

help(rkllm.base.common)

This is basically from gguf-py/gguf/constants.py. They replaced "GGUF" to "LLM"(or RKLLM), for example:

    LLM_DEFAULT_ALIGNMENT = 32
    LLM_MAGIC = 1179993927
    LLM_VERSION = 3

https://github.com/ggerganov/llama.cpp/blob/316c7faf7740fa98ea68f1445f4505810f706b9e/gguf-py/gguf/constants.py#L7-L14

there are some new qtypes:

   class RKLLMQuantizationType(enum.IntEnum)
     |  RKLLMQuantizationType(value, names=None, *, module=None, qualname=None, type=None, start=1)
     |  
     |  An enumeration.
     |  
     |  Method resolution order:
     |      RKLLMQuantizationType
     |      enum.IntEnum
     |      builtins.int
     |      enum.Enum
     |      builtins.object
     |  
     |  Data and other attributes defined here:
     |  
     |  F16 = <RKLLMQuantizationType.F16: 1>
....
     |  Q8_1 = <RKLLMQuantizationType.Q8_1: 9>
     |  
     |  Q8_K = <RKLLMQuantizationType.Q8_K: 15>
     |  
     |  W4A16_C_0_0 = <RKLLMQuantizationType.W4A16_C_0_0: 105>
     |  
     |  W4A16_C_1_0 = <RKLLMQuantizationType.W4A16_C_1_0: 106>
     |  
     |  W4A16_G128_0_0 = <RKLLMQuantizationType.W4A16_G128_0_0: 112>
     |  
     |  W4A16_G64_0_0 = <RKLLMQuantizationType.W4A16_G64_0_0: 111>
     |  
     |  W4A8_C_0_0 = <RKLLMQuantizationType.W4A8_C_0_0: 107>
     |  
     |  W4A8_C_0_1 = <RKLLMQuantizationType.W4A8_C_0_1: 108>
     |  
     |  W4A8_C_1_0 = <RKLLMQuantizationType.W4A8_C_1_0: 109>
     |  
     |  W4A8_C_1_1 = <RKLLMQuantizationType.W4A8_C_1_1: 110>
     |  
     |  W8A8_C_0_0 = <RKLLMQuantizationType.W8A8_C_0_0: 101>
     |  
     |  W8A8_C_0_1 = <RKLLMQuantizationType.W8A8_C_0_1: 102>
     |  
     |  W8A8_C_1_0 = <RKLLMQuantizationType.W8A8_C_1_0: 103>
     |  
     |  W8A8_C_1_1 = <RKLLMQuantizationType.W8A8_C_1_1: 104>
     |  
     |  W8A8_G128_0_0 = <RKLLMQuantizationType.W8A8_G128_0_0: 113>
     |  
     |  W8A8_G256_0_0 = <RKLLMQuantizationType.W8A8_G256_0_0: 114>
     |  
     |  W8A8_G512_0_0 = <RKLLMQuantizationType.W8A8_G512_0_0: 115>

help(rkllm.base.converter)

Original: `convert-hf-to-gguf.py', 'gguf_reader.py', 'gguf_writer.py'. Example:

    class LLMWriter(builtins.object)
     |  LLMWriter(fout, arch: 'str', use_temp_file: 'bool' = True, endianess: 'LLMEndian' = <LLMEndian.LITTLE: 0>)
     |  
     |  Methods defined here:
     |  
     |  __init__(self, fout, arch: 'str', use_temp_file: 'bool' = True, endianess: 'LLMEndian' = <LLMEndian.LITTLE: 
0>)
     |  
     |  add_add_bos_token(self, value: 'bool') -> 'None'
     |  
     |  add_add_eos_token(self, value: 'bool') -> 'None'
     |  
     |  add_add_space_prefix(self, value: 'bool') -> 'None'
     |  
     |  add_architecture(self) -> 'None'
     |  
     |  add_array(self, key: 'str', val: 'Sequence[Any]') -> 'None'
     |  
     |  add_author(self, author: 'str') -> 'None'

https://github.com/ggerganov/llama.cpp/blob/316c7faf7740fa98ea68f1445f4505810f706b9e/gguf-py/gguf/gguf_writer.py#L33-L60

help(rkllm.base.eval)

Evaluator based on common existing codes, like https://github.com/QwenLM/Qwen/blob/main/eval/evaluate_cmmlu.py

help(rkllm.base.interpreter)

Seems to be a rknpu2 matmul api tester for internal development usage. To import this, you must copy a librknnc.so (from rknn-toolkit2) to /home/api/lib/linux-x86_64/cp38/.

help(rkllm.base.loader)

PyTorch model loader??

help(rkllm.base.log)

Their logger. Also used in rknn-toolkit2(-lite2)

help(rkllm.base.quantizer)

Original: awq-py/awq/apply_awq.py.

help(rkllm.base.trainer)

Original: https://github.com/bytedance/decoupleQ/blob/252939bc528eb9fd3c1b704acbb2a8c1ed06aa3c/decoupleQ/moq_quant.py

Edit:

Also worth noticing is that at first glance their quant method don't need dataset, but this is probably an illusion, because there is a small fraction of wikitext dataset bundled into the pip package.

@danielmerja
Copy link

@marty1885 there was an update two weeks ago from these guys.

https://github.com/airockchip/rknn-toolkit2/releases/tag/v2.2.0

@IngwiePhoenix
Copy link
Author

I came back to this as I was looking what the situation is nowadays.

Armbian does not seem to compile the rknpu driver required natively into their kernels, which is a bit unfortunate, and there also does not seem to be a DKMS package. However, this is looking pretty neat regardless. I have a Radxa ITX board now - so I'll see what I can get done now. Thank you for all the links, genuely impressive!

@bmilde
Copy link

bmilde commented Nov 9, 2024

@IngwiePhoenix Yeah, the driver situation isn't ideal and I would like to run the NPU on more recent kernels too. On my Armbian Orange Pi 5 I'm running Linux kernel 6.11.6 with Armbian patches. I gave the DKMS driver a shot, but it doesn't fully compile yet: https://github.com/bmilde/rknpu-driver-dkms

Please contact me if you can help. There are a few rockchip headers (and rockchip specific functionality) that the Armbian current kernel doesn't contain. Maybe these can be copied into the DKMS driver, so that it is self contained.

@Fuckingnameless
Copy link

Fuckingnameless commented Nov 19, 2024

rknn toolkit added Arm support
https://www.reddit.com/r/RockchipNPU/comments/1goleu2/rkkntoolkit2_now_supports_arm/

also here the numbers i got in llamacpp 9 months ago cpu only on the rock 5b running dietpiOS
using 4 threads to avoid the 4 slower cores works best

3b Q4KM = 6.8t/s
7b Q4KM = 4.8t/s
2x7b Q3KM = 3.8t/s
10.7b Q4KM = 2.5t/s
13b Q3KM = 2t/s
4x7b IQ2XS = 1-2t/s(laserxtral, IQ quants run bad)

@danielmerja
Copy link

rknn toolkit added Arm support

https://www.reddit.com/r/RockchipNPU/comments/1goleu2/rkkntoolkit2_now_supports_arm/

also here the numbers i got in llamacpp 9 months ago cpu only on the rock 5b running dietpiOS

using 4 threads to avoid the 4 slower cores works best

3b Q4KM = 6.8t/s

7b Q4KM = 4.8t/s

2x7b Q3KM = 3.8t/s

10.7b Q4KM = 2.5t/s

13b Q3KM = 2t/s

4x7b IQ2XS = 1-2t/s(laserxtral, IQ quants run bad)

Sounds nice but the problem with Rockchip is that they refuse to open-source it and basically the community is tired of waiting for months for them to catchup to the latest updates.

@atiltman
Copy link

rknn toolkit added Arm support https://www.reddit.com/r/RockchipNPU/comments/1goleu2/rkkntoolkit2_now_supports_arm/

also here the numbers i got in llamacpp 9 months ago cpu only on the rock 5b running dietpiOS using 4 threads to avoid the 4 slower cores works best

3b Q4KM = 6.8t/s 7b Q4KM = 4.8t/s 2x7b Q3KM = 3.8t/s 10.7b Q4KM = 2.5t/s 13b Q3KM = 2t/s 4x7b IQ2XS = 1-2t/s(laserxtral, IQ quants run bad)

Can't wait to see where things will go once mainline support is in, been playing with this for a few days and its pretty fast with the NPU!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests