-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raspberry Pi 4 4GB #58
Comments
It looks it's possible to pack it to a AWS Lambda on ARM Gravitron + S3 weights offloading. |
Is it swapping? |
@neuhaus kswapd0 process is pretty active |
@dalnk Do you have swap in the system? |
What do you change for be able to run it in the PI. i have a pc 4 times better and crash every i tried @miolini |
@MarkSchmidty thank you for sharing your results. I believe my system swaped a lot due to limit size of RAM (4GB RAM, model size 4GB). |
Ah, yes. A 3-bit implementation of 7B would fit fully in 4GB of RAM and lead to much greater speeds. This is the same issue as in #97. 3-bit support is a proposed enhancement in GPTQ Quantization (3-bit and 4-bit) #9. GPTQ 3-bit has been shown to have negligible output quality vs uncompressed 16-bit and may even provide better output quality than the current naive 4-bit implementation in llama.cpp while requiring 25% less RAM. |
@MarkSchmidty Fingers crossed! |
I'm currently unable to build for aarch64 on an RPi 4 due to missing SIMD dot product intrinsics (vdotq_s32). Replacing them with
|
@Ronsor Could you please share build log? |
@Ronsor something wrong with your environment. My build log on RPI starts with: |
Which distro are you using? I'm just on vanilla Raspberry Pi OS. It seems the vdotq change is the issue. |
Readding the old dot product code fixed my issue |
Now that I fixed that (I'll submit a PR soon), running on an 8GB Pi results in not-terrible performance:
~1 token/sec |
Hey @Ronsor I'm having the same issue. Could you say exactly what you did to fix ? |
@davidrutland Basically undo commit 84d9015 and it should build fine |
Not able to build with my Rpi 4 4GB running ubuntu 22.10
However, when removed changes added #67 related to |
I trie to run
Turns out I am using the fp16 model therefore it core dumped. It was resolved after I run the correct command. I would suggest that we should note on the the usage in README that this step is platform agnostic and user should consider run this in a desktop device and copy over if they are running the model in the lower spec devices like Raspberry Pi. WDTY? |
@Mestrace I am also getting a segfault core dump, but when quantizing on my desktop with plenty of ram available. what was the wrong and correct commands you ran in relation to the fp16 model? |
@octoshrimpy I believe Mestrace is saying you should convert and quantize the model on a desktop computer with a lot of RAM first, then move the ~4GB 4bit quantized mode to your pi. |
@MarkSchmidty that is what I am attempting, haha. is 16GB ram free not enough for quantizing 7B? This is what I'm running into, unsure where to go from here. |
Run |
@octoshrimpy What I did
|
@Mestrace what command did you use for quantizing? |
@gjmulder I have 350G of space available, and plenty of ram. quantize immediately crashes with segfault, so there is no ram/disk utilization to view. are there logs I can check, or a INFO level logging I can enable? |
I did everything on RPI 4. Just enable swap (8GB+) in your system. |
@miolini i'm not using a pi for quantization. doing it on arch linux, with 16 gigs of ram, 10 gigs of swap, and barely anything else running (average 2 gigs used before I start) |
... 10 years per token. Do not ask me. |
Yes, if you don't have enough ram it's going to be too slow to be useful. There are other models useful for interesting projects for devices with less ram. whisper.cpp only needs 64MB of RAM for the smallest model and 1GB for the largest, for example. |
Could you please share your build log and output of command |
Never mind it worked for me, I had to use the old school |
Seems like lines 1938 and 1939 of ggml.c should be changed from
to
|
doesn't this fry SD cards? maybe i'm not understanding this correctly |
it does sadly, in my case it was for a PoC of 2hours not a lengthy usage so its fine. |
This is awesome and on my 8GB Desktop it runs fine. But with 4GB i still not get it running. I change my swap file to 4GB i play with the parameters...My question is: does anybody know how the parameters : llama_model_load: memory_size = 1024.00 MB, n_mem = 65536 can be changed? in all screens i saw n_mem is about 16000 and memory size 512. And why is the ctx size so strange and not 4500MB? |
I failed to build on Pi 4
|
I also managed to run LLaMA 7B on a Raspberry Pi 4! I recorded a video of the process if anyone is interested (8:10 for the inference demo) Video |
I managed to get it running on Rock Pi 4SE but as you mentioned it's super slow. I also managed to get it working with openCL but really it didn't make any difference. |
Did you succeed in running it on a AWS Lambda instance? If so, what memory size? |
KV cache is now cyclic split into permuted V variant The ggml_tensor_print function has been completely reworked to output proper 1-4dim tensors with data. Example: ``` +======================+======================+======================+======================+ | :0 | V [f32 type] +----------------------+----------------------+----------------------+----------------------+ | Dimensions | Strides | Layer id | Backend | | 3 | 4x16x1024 | 0 | CPU | +----------------------+----------------------+----------------------+----------------------+ | Elements | Src0 | Src1 | Operation | | 4 x 64 x 2 | 4 x 64 x 2 | N/A | CONT | +----------------------+----------------------+----------------------+----------------------+ | Transposed: No | Permuted: No | Contiguous: Yes | Size: 0.00 MB | | Src0 name: | cache_v (view) (permuted) | +----------------------+----------------------+----------------------+----------------------+ +-------------------------------------------------------------------------------------------+ | Content of src0 "cache_v (view) (permuted)" (3 dim) +-------------------------------------------------------------------------------------------+ | Content of src0 "cache_v (view) (permuted)" (3 dim) | Total Elements : [ Row:4 Col:64 Layer:2 ] +-------------------------------------------------------------------------------------------+ | Row 1: [0.302 , 0.010 ] [-0.238 , 0.680 ] [0.305 , 0.206 ] [-0.013 , 0.436 ] [-0.074 , -0.698 ] [-0.153 , -0.067 ] | Row 2: [0.091 , 0.199 ] [0.253 , 0.151 ] [-0.557 , 0.089 ] [0.298 , -0.272 ] [-0.149 , 0.232 ] [-0.217 , 0.193 ] | Row 3: [-0.085 , -0.014 ] [0.225 , 0.089 ] [-0.338 , 0.072 ] [0.416 , -0.186 ] [-0.071 , 0.110 ] [0.467 , 0.497 ] | Row 4: [-0.336 , 0.471 ] [-0.144 , 0.070 ] [-0.062 , 0.520 ] [0.093 , 0.217 ] [-0.332 , -0.205 ] [0.012 , 0.335 ] +-------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------+ | Content of dst "V" (3 dim) +-------------------------------------------------------------------------------------------+ | Content of dst "V" (3 dim) | Total Elements : [ Row:4 Col:64 Layer:2 ] +-------------------------------------------------------------------------------------------+ | Row 1: [0.302 , 0.010 ] [-0.238 , 0.680 ] [0.305 , 0.206 ] [-0.013 , 0.436 ] [-0.074 , -0.698 ] [-0.153 , -0.067 ] | Row 2: [0.091 , 0.199 ] [0.253 , 0.151 ] [-0.557 , 0.089 ] [0.298 , -0.272 ] [-0.149 , 0.232 ] [-0.217 , 0.193 ] | Row 3: [-0.085 , -0.014 ] [0.225 , 0.089 ] [-0.338 , 0.072 ] [0.416 , -0.186 ] [-0.071 , 0.110 ] [0.467 , 0.497 ] | Row 4: [-0.336 , 0.471 ] [-0.144 , 0.070 ] [-0.062 , 0.520 ] [0.093 , 0.217 ] [-0.332 , -0.205 ] [0.012 , 0.335 ] +-------------------------------------------------------------------------------------------+ +======================+======================+======================+======================+ ```
Hi!
Just a report. I've successfully run the LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow at about 10 sec/token. But it looks like we can run powerful cognitive pipelines on a cheap hardware. It's awesome. Thank you!
Hardware : BCM2835
Revision : c03111
Serial : 10000000d62b612e
Model : Raspberry Pi 4 Model B Rev 1.1
%Cpu0 : 71.8 us, 14.6 sy, 0.0 ni, 0.0 id, 2.9 wa, 0.0 hi, 10.7 si, 0.0 st
%Cpu1 : 77.4 us, 12.3 sy, 0.0 ni, 0.0 id, 10.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 81.0 us, 8.6 sy, 0.0 ni, 0.0 id, 10.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 77.1 us, 12.4 sy, 0.0 ni, 1.0 id, 9.5 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3792.3 total, 76.2 free, 3622.9 used, 93.2 buff/cache
MiB Swap: 65536.0 total, 60286.5 free, 5249.5 used. 42.1 avail Mem
2705518 ubuntu 20 0 5231516 3.3g 1904 R 339.6 88.3 84:16.70 main
102 root 20 0 0 0 0 S 14.2 0.0 29:54.42 kswapd0
main: seed = 1678644466
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
main: prompt: 'The first man on the moon was '
main: number of tokens in prompt = 9
1 -> ''
1576 -> 'The'
937 -> ' first'
767 -> ' man'
373 -> ' on'
278 -> ' the'
18786 -> ' moon'
471 -> ' was'
29871 -> ' '
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
The first man on the moon was 20 years old and looked^[ lot like me. In fact, when I read about Neil Armstrong during school lessons my fa
The text was updated successfully, but these errors were encountered: