-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : grouped-query attention + LLaMAv2 70B support #2276
Conversation
So it seems another model parameter is needed, group size? Which is 1 for the older LLaMA models and 8 for the new ones? |
Maybe try to deduce it for now, to avoid file format mess. We'll add it in GGUF Or temporary pass it from cmd line |
Do I understand correctly that this mechanism will reduce the KV cache size as well? |
In terms of CUDA support: for me the release has come at a rather inopportune time. I'll maybe have some time to look into it on the weekend but I don't want to make any promises. Other people are welcome to work on it in the meantime (but if you do please notify me to avoid duplicate implementations). |
Yes, the mechanism (GQA) is solely for reducing KV cache size by compromising some amount of accuracy. |
How to run ./quantize ? |
|
Thank you for clearing up the steps - however, I'm seeing this error:
Looking at the Anything I can try to overcome this? |
@nrbontha, .pth files are supported from the start of llama.cpp so I think the problem you have is that the files you have is corrupted. |
Is there an MD5 or Checksum SlyEcho? |
Meta's script also downloads MD5 checksums, but just check if the file is not 0 bytes, I had that problem when downloading. |
Thanks for trying to help @SlyEcho, but I'm afraid that isn't the issue (at least for me):
|
Alright, I have to check now. |
I got this error as an output from main on apple silicon |
@l0d0v1c Are you compiling with the changes applied that this PR propose? |
Thanks I just did it partially now it works fine |
It looks like changes are needed in convert.py to work for 70b HF files as the This function convert the A simple temporary solution would be to change the following line to Line 187 in 294f424
|
Remember to run |
Getting garbage output from q4_0 converted from safetensor model https://huggingface.co/TheBloke/Llama-2-70B-fp16, but works fine with q4_0 converted from (original?) .pth model https://huggingface.co/anonymous4chan/llama-2-70b-original . The safetensor model files were downloaded ok (checked sha256sums) so guessing they are corrupt or somehow incompatible with the convert.py script. The model was converted from pth using the latest llama 2 PR according to the model card. |
Tried to convert another 70b safetensor model https://huggingface.co/NousResearch/Llama-2-70b-hf and quantized it to q4_0 and it also outputs garbage. An llama2-7b safetensor https://huggingface.co/NousResearch/Llama-2-7b-hf works fine, Is the current convert.py incompatible with the llama2-70b safetensor files? |
@schappim What hardware are you using? I tried the 13b-chat model on my M1 Macbook Air, but it is incredibly slow. I cannot run on the GPU (even the 7b model) because I run out of memory (only 8GB total). |
I too am getting garbage output, both from fp16 and q4_0. Steps taken:
q4_0 result:
fp16 inference result is also not usable:
Maybe the n_mult = 256 line is wrong? @schappim what changes did you make to convert.py to make your fp16? |
The value does not matter as the PR do not use it anyway. Lines 1029 to 1030 in 2d2bb6b
|
Ah OK, thanks Then all I can think is maybe it worked for schappim because he was converting from PTH, and we're converting from HF? I don't know what practical difference that would make though. And that's not much help for the model I'm trying to convert now, which is a fine tune so I only have it in HF/pytorch format. |
It looks like Meta did change the structure of the 70b HF models somehow. |
ggerganov, thank you for the code fix for running 70B model! steps to reproduce: hardware & software : |
@Kangmo This error would happen if you ran the |
Re-running convert.py fixed the problem. thank you! |
Apologize if someone already answered it, but why does 70B use smaller scratch buffers than 65B? |
The K and V tensors have 8 times less heads so the KV cache uses 8 times less memory and some of the intermediate tensors are also 8 times smaller. |
Current on commit master-41c6741, still having same issue when trying to convert llama2-70b model. Same error for quantized versions q4_0 and q5_0. I have built using CMake and have also verified checksums of the downloaded models COMMAND: ERROR: |
Add |
Perfect, thank you. Thought when this was merged, it was automatically accounting for this. |
When I try to run 70b with master error log
Uncommenting line 1062 which hard-codes |
Should this be working with Metal? I'm currently on master-41c6741 and trying to run this with
Full command: |
No GQA support in Metal yet - I don't have a machine with enough RAM to test and fix this. Waiting for my M2 Ultra to arrive - 6 weeks shipment already and counting .. :( |
Thank you very much for the clarification. |
No, I converted the model from the official Meta weights. The problem is perhaps that I used the WIP PR for the conversion? |
Yes, very likely. Re-run the |
I re-ran |
@ggerganov - this may be of no help to you as I'd imagine you want a debugger at hand, but if I can be an interim bridge for you and run some tests on a 192GB M2 Ultra, let me know. Also just wanted to thank you for the level of effort you've put into guiding the architecture and principles of this project — a truly awesome accomplishment and appreciated by a large community! |
not very relevant probably, but does base 70B model comes with 2K context only? when I pass it 4K context with parameter - it says that only 2k is supported? |
the 2k warning has not been updated since. gguf will come with |
@ggerganov - Thanks for getting this working: I'm able to run Llama2 70B models (i.e state-of-the-art open-source models) in q6_K on an M2 Max MacBook Pro with 64GB. However, it is somewhat slow, rather slower than reading speed, so it would be lovely to get the TODOs mentioned above fixed to enable Metal GPU acceleration. |
@RDearnaley there where prs the last few days that where merged, that should improve performance. When did you last pull? |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
No. Fok off outta 'ere. |
ref #2262
-gqa 8
. With GGUF it will be read from the the model hparamsffn_dim_multiplier
needed to determine the correct value forn_ff
is hardcoded to1.3
when a 80 layer model withGQA == 8
is loaded (i.e. this corresponds to 70Bv2). Otherwise, it defaults to1.0
. Also needs to be read from the model hparams in the futureSome notes while working on this:
I haven't done perplexity calcs yet, but I ran a few text generations using the small 7Bv2 model. My impression is that with this new model, the
Q4_0
andQ5_0
quantizations do not work well. What I notice is that after the first sentence ends, the new sentence starts of in wild ways, often switching to German or some other language. This is something I have not observed with the original LLaMA 7B. It would also often start the second sentence without a capital letter, which again has never been the case before.The
QX_1
,QX_K
andQ8_0
quantizations do not seem to exhibit this behaviour (or if they do it is to a much smaller extend)Note, the examples below are not related to the changes in this PR. The same behaviour is observed on
master
using the LLaMAv2 7B model.Here are a few examples:
Old description below
This works for 70B LLaMA-v2:
python3 convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/ ./quantize ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0 ./main -m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8
Currently, only CPU.
This is very quick and dirty - just to see what changes are necessary.
Good news is the
convert.py
script does not require changes.To add support for GPU and BLAS, fix the following TODOs:
llama.cpp/ggml.c
Lines 10729 to 10748 in 2d2bb6b
I.e., implement the
mul_mat
broadcast logic from here:llama.cpp/ggml.c
Lines 10832 to 10850 in 2d2bb6b
Looking for contributions to make this cleaner. If not, I will implement it some time in the next days.