-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3 family support #6747
Comments
As far as I can see, it seems to be the same as before from the architecture point of view. There might be some extra stuff to optimize more https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF |
@maziyarpanahi i am getting a "tokenizer.model" not found error. How did you resolve this ? |
Calling convert-hf-to-gguf.py ends up with |
I have the latest pulled and built from a few hours ago. I am getting worried now with all these failed converts! I tested the quants, they work though. |
@maziyarpanahi Have you confirmed that instruct mode works? That's where I'm seeing issues (possibly user error). Edit: Nevermind, figured out the chat template. |
I tried using the quantized instruct model from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf, but when I try using it (specifying llama2 chat template) I get odd results, which seem like an issue with the chat template:
|
@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053. |
Ooooh thank you, I was looking but didn't see any info about that |
@MoonRide303 sorry to bug you, but do you know how to specify the new template format? I assume that eventually it will be added to llama.cpp as one of the predefined templates along with llama2 and chatml, but I'm not sure how to specify it in the meantime. |
In their code, the chat format is here: https://github.com/meta-llama/llama3/blob/299bfd8212fec65698c2f8c7b5970cbbb74c2a4f/llama/tokenizer.py#L202 |
Need to add |
Just pulled latest from master. When trying to convert from HF/safetensors to GGUF using
When trying to convert from HF/safetensors to GGUF using
Hopefully this can be useful as a reference. Thanks! |
@thecivilizedgamer |
Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode? The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml? |
No but you can use the the |
I feel you man, also using server mode and wanting to integrate llama3 with my application so want to get the new template up and running, from what it looks like, you can implement your own chat template, and then just rebuild llama.cpp, going to try that now |
Is there any reason not to add a chat template command line argument to ./server? |
So did a zig build with cuda support (first time ever using zig and wow it's amazing) |
template for llama 3 is --in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> " also you have to add |
@mirek190
Look at my version above, it seems more correct to me. |
According to https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ My template is correct. I made test ... llamacpp main.exe --model models/new3/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color --threads 30 --keep -1 --batch-size 512 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> " --reverse-prompt "assistant" Seriously ! wtf
even made this question almost properly! ... forgot about 1 gold coin...
look on this! insane! only GPT4 and OPUS can answer it!
or this ...
insane ...
THAT llama3 8b is INSANE. |
Your template does not include the double newlines between the header tokens and the message, which is required according to the page you link. That's the main difference between your and Jipok's template. |
This can't be right, because now the model is not allowed to say the word "assistant"..... |
so, how it should look like for llamacpp? ps you are right this one is better. main.exe --model models/new3/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color --threads 30 --keep -1 --batch-size 512 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"
that 8b llama3 is insane ! |
@DifferentialityDevelopment According to Meta specifications:
You are missing the |
@DifferentialityDevelopment Before that, you should add a test of this new template in the test-chat-template.cpp file. To submit the changes: fork |
I've added the missing newlines after end_header_id, thanks for spotting this, also I did add two lines required in test-chat-template.cpp Pull request is here: #6751 |
@DifferentialityDevelopment But it would be nice if Llama 3 chat template would be supported natively in both main and server - especially taking into account name of this project ;). |
Pretty sure the changes I made just affect the server, which is what I mainly use, I integrated Llama.cpp into my C# applications using the server |
I pulled the last changes and recompiled, but I get a Is anybody else having the same problem? I'm not sure if it's related to llama 3 Full output
|
Apparently the weights from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF load just fine! |
Sounds like corrupted download, that model works perfectly for me |
8b works for me but same loading error with 70B |
I can confirm, that https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q8_0.gguf works fine with llama_cpp_python 0.2.62 and I am not entirely sure which version of llama.cpp is being pulled in by the python wrapper though, but I guess it is recent enough as it works just fine. Answers are almost too verbose for my taste, I guess this can be tuned via parameters, but the quality of the answers is really great so far! I run the model split over 3 MI25 GPUs and it's super fast too! EDIT: it seems longer answers get cut off in the UI, I am not quite sure if it's related to a setting that I may be missing, to llama.cpp or to the text-generation-webui... What's the best way to figure out why this is happening? |
Check this pull request :) |
I've heard lots of reports that IQ and imatrix quants are broken. |
On the llamafile project, we're using Mozilla-Ocho/llamafile@da4d780 as a workaround to the stop token issue. It fixes the issue with llama3 rambling on for 70B, but doesn't appear to work for 8B. |
Did anyone already try https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main ? It says it was reuploaded with the new end token. I am also a bit confused about the files, there are three files for Q8 - are they supposed to be concatenated after the download or will llama.cpp handle a model that is split over multiple files? I am a bit hesitant to be the first one to try, because it'd take me a few days to download on my connection and will eat a huge portion of my monthly traffic... |
python convert.py ./models/Meta-Llama-3-8B Anyone have any ideas why I'm getting this? |
Did you checkout the proposed conversion support? 🙂 #6745 |
Yes, QuantFactory's re-upload works like a charme! (I tried 70B Q8) You just need to point llama.cpp (e.g. ./server) to the first file, e.g. Meta-Llama-3-70B-Instruct.Q8_0-00001-of-00003.gguf). It will load the others. |
I think we are good here, please reopen any case |
Has anyone been able to successfully convert the 70B model using
and getting:
I have checked the file integrity with |
It works for me: $ ▶ python3 convert.py ~/Data/huggingface/Meta-Llama-3-70B-Instruct/ --outfile ./x.gguf --outtype f16 --vocab-type bpe
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00001-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00001-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00002-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00003-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00004-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00005-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00006-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00007-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00008-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00009-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00010-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00011-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00012-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00013-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00014-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00015-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00016-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00017-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00018-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00019-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00020-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00021-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00022-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00023-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00024-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00025-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00026-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00027-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00028-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00029-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00030-of-00030.safetensors
params = Params(n_vocab=128256, n_embd=8192, n_layer=80, n_ctx=8192, n_ff=28672, n_head=64, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct'))
Loaded vocab file PosixPath('/Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/tokenizer.json'), type 'bpe'
Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
Permuting layer 0
Permuting layer 1
Permuting layer 2
... |
I figured it out, somehow my
And then the conversion worked. Now I just have to find the RAM to run it locally 😅 |
Does anyone also get this problem? I tried to follow the step in https://huggingface.co/IlyaGusev/saiga_llama3_8b_gguf |
Did you find a solution to this error? |
Compared to Mistral's simplicity Meta's prompt format is not just over-engineered to the point of illegibility, but also eats up their rather small context window... (8 times smaller than new Mixtral's). But I suppose that's what's required when you are dealing with 15 trillion tokens...:) |
If you download the weights from Meta's website/download.sh, use this: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py |
In my case, after update from a previous one to the latest llama.cpp (tag: b2797), its conversion became valid. |
Anyone know how to setup for batched API called correctly for Llama-3? I use the same preprocessing code on Llama-cpp-python (doesn't support batched inference) on my dataset - the accuracy is 40%. However, Llama-cpp is only 25%. I use the same model. Cmd runserver: Code to process and call API:
My code to setup Llama-cpp-python:
|
llama3 released
would be happy to use with llama.cpp
https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
https://github.com/meta-llama/llama3
The text was updated successfully, but these errors were encountered: