Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3 family support #6747

Closed
gulldan opened this issue Apr 18, 2024 · 79 comments
Closed

llama3 family support #6747

gulldan opened this issue Apr 18, 2024 · 79 comments
Labels
enhancement New feature or request model Model specific

Comments

@gulldan
Copy link

gulldan commented Apr 18, 2024

llama3 released
would be happy to use with llama.cpp
https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6

https://github.com/meta-llama/llama3

@gulldan gulldan added the enhancement New feature or request label Apr 18, 2024
@maziyarpanahi
Copy link

As far as I can see, it seems to be the same as before from the architecture point of view. There might be some extra stuff to optimize more https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF

@johnrachwan123
Copy link

@maziyarpanahi i am getting a "tokenizer.model" not found error. How did you resolve this ?

@MoonRide303
Copy link

Calling convert-hf-to-gguf.py ends up with FileNotFoundError: Cannot find Llama BPE tokenizer. Trying to use tokenizer.model from original folder in https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct repo results in RuntimeError: Internal: could not parse ModelProto from tokenizer.model.

@maziyarpanahi
Copy link

I have the latest pulled and built from a few hours ago. I am getting worried now with all these failed converts! I tested the quants, they work though.

@brandon-e2e
Copy link

brandon-e2e commented Apr 18, 2024

@maziyarpanahi Have you confirmed that instruct mode works? That's where I'm seeing issues (possibly user error).

Edit: Nevermind, figured out the chat template.

@thecivilizedgamer
Copy link

I tried using the quantized instruct model from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf, but when I try using it (specifying llama2 chat template) I get odd results, which seem like an issue with the chat template:

Hi! How can I assist you today? [/INST] <>

Wait, no! You're the one who is supposed to respond. I made a mistake! [/INST] <> You are the one who is supposed to respond now. [/INST] <>
I think there was a miscommunication! Can you please clarify what you need help with? [/INST] <> [/INST] I need to create a new Google account and set up my computer. Also, I've got some questions about the Google Home app. [/INST]

@MoonRide303
Copy link

MoonRide303 commented Apr 18, 2024

@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053.

@thecivilizedgamer
Copy link

@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053.

Ooooh thank you, I was looking but didn't see any info about that

@thecivilizedgamer
Copy link

@MoonRide303 sorry to bug you, but do you know how to specify the new template format? I assume that eventually it will be added to llama.cpp as one of the predefined templates along with llama2 and chatml, but I'm not sure how to specify it in the meantime.

@jxy
Copy link
Contributor

jxy commented Apr 18, 2024

In their code, the chat format is here: https://github.com/meta-llama/llama3/blob/299bfd8212fec65698c2f8c7b5970cbbb74c2a4f/llama/tokenizer.py#L202

@arch-btw
Copy link
Contributor

python convert.py Meta-Llama-3-8B-Instruct --outtype f16 --vocab-type bpe

Need to add --vocab-type bpe as mentioned here: #6745 (comment)

@ddh0
Copy link
Contributor

ddh0 commented Apr 18, 2024

Just pulled latest from master. When trying to convert from HF/safetensors to GGUF using convert-hf-to-gguf.py I get:

Loading model: Meta-Llama-3-8B-Instruct
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: context length = 8192
gguf: embedding length = 4096
gguf: feed forward length = 14336
gguf: head count = 32
gguf: key-value head count = 8
gguf: rope theta = 500000.0
gguf: rms norm epsilon = 1e-05
gguf: file type = 1
Set model tokenizer
Traceback (most recent call last):
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 1302, in set_vocab
    self. _set_vocab_sentencepiece()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 330, in _set_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: Meta-Llama-3-8B-Instruct/tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 2728, in <module>
    main()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 2715, in main
    model_instance.set_vocab()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 1304, in set_vocab
    self._set_vocab_llama_hf()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylan/Documents/AI/llama.cpp/convert.py", line 532, in __init__
    raise FileNotFoundError('Cannot find Llama BPE tokenizer')
FileNotFoundError: Cannot find Llama BPE tokenizer

When trying to convert from HF/safetensors to GGUF using convert.py I get:

Loading model file Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1548, in <module>
    main()
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1515, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1417, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1407, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

Hopefully this can be useful as a reference. Thanks!

@Jipok
Copy link

Jipok commented Apr 18, 2024

@thecivilizedgamer
./main -m ~/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color -n -2 -e -s 0 -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -ngl 99 --mirostat 2 -c 8192 -r '<|eot_id|>' --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -i

@thecivilizedgamer
Copy link

Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode?

The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml?

@phymbert
Copy link
Collaborator

Do you know if there is a way to manually specify the chat format in server mode?

No but you can use the the infill endpoint as shown above in main.

@DifferentialityDevelopment
Copy link
Contributor

Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode?

The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml?

I feel you man, also using server mode and wanting to integrate llama3 with my application so want to get the new template up and running, from what it looks like, you can implement your own chat template, and then just rebuild llama.cpp, going to try that now

@DifferentialityDevelopment
Copy link
Contributor

This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it

image

@avioligo
Copy link

Is there any reason not to add a chat template command line argument to ./server?
It can be a string like here: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/c9231f629c54de150fe4cca99a98034f32fb589e/tokenizer_config.json#L2053

@DifferentialityDevelopment
Copy link
Contributor

So did a zig build with cuda support (first time ever using zig and wow it's amazing)
My code changes seems to work really well, model is responding coherently, anyone that can give me a brief primer on how I'd actually submit the changes as a pull request?

@mirek190
Copy link

mirek190 commented Apr 18, 2024

template for llama 3 is

--in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> "

also you have to add
--reverse-prompt "assistant"
as well as is spamming badly ;)

@Jipok
Copy link

Jipok commented Apr 18, 2024

@mirek190
This is different from what is said in the meta-llama/llama3:

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Look at my version above, it seems more correct to me.

@mirek190
Copy link

mirek190 commented Apr 18, 2024

@mirek190 This is different from what is said in the meta-llama/llama3:

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Look at my version above, it seems more correct to me.

According to

https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

My template is correct.

I made test ...
This 8b model is insane.... almost level of wizard 2 8x22b! which has 140b parameters....wtf!

llamacpp

main.exe --model models/new3/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color --threads 30 --keep -1 --batch-size 512 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> " --reverse-prompt "assistant"

Seriously ! wtf

Today I own 3 cars but last year I sold 2 cars. How many cars do I own today?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> Let's break it down!                                                                                                                                                                                                                                                             You currently own 3 cars.
Last year, you sold 2 cars.

To find the number of cars you owned before selling any, add the current number to the number of cars sold:
3 (current) + 2 (sold) = 5 cars

Since you've already sold those 2 cars, subtract them from the total:
5 - 2 = 3 cars

You still own the same 3 cars that you currently own.

even made this question almost properly! ... forgot about 1 gold coin...

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> Let's break down the changes you've experienced step by step!

1. You start with 10 apples.
2. You find 3 gold coins in the river, which doesn't affect your apple count.
3. You lose 4 apples, leaving you with 6 apples (10 - 4 = 6).
4. The birds drop 6 apples each, so that's a total of 18 new apples (6 x 3 = 18). Adding these to your remaining apples gives you a total of 24 apples (6 + 18 = 24).
5. You win 6 gold coins in the game but have to share them equally with your teammates. That means you get 2 gold coins (6 ÷ 3 = 2).
6. You spend all the gold coins you have (including the original 3 and the new 2) on apples, since each apple costs 0.5 coins.

Let's calculate how many apples you can buy with your total gold coins:

You have a total of 5 gold coins (3 + 2). Since each apple costs 0.5 coins, you can buy:

5 coins ÷ 0.5 coins/apple = 10 apples

Adding these to the apples you already had, you now have a total of:

24 apples + 10 apples = 34 apples

As for where the river is located, it runs near a big city that allows you to spend the gold coins on something (apples). Since you found the gold coins in the river and then went to play an online game with your teammates, it's likely that the city is nearby or easily accessible.

You now have 34 apples!

look on this! insane! only GPT4 and OPUS can answer it!

Create 10 sentences that ends with a word "apple". Remember the word "apple" MUST be at the end.
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> Here are 10 sentences that end with the word "apple":

1. The farmer picked a juicy red apple.
2. She ate a crunchy snack of peanut butter and apple.
3. The teacher asked her student to bring a small green apple.
4. I love biting into a fresh, crisp autumn apple.
5. Grandma always kept a sweet, ripe Granny Smith apple.
6. The health-conscious person chose a healthy Gala apple.
7. Dad planted a row of seedless Red Delicious apple.
8. In the still life painting, there was a beautiful Golden apple.
9. Mom packed a lunchbox with a juicy Fuji apple and cheese.
10. At the farmer's market, I bought a big, ripe Braeburn apple.

Each sentence ends with the word "apple", as requested!

or this ...

What are the top 10 most famous non-human animal statues?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> What a unique question!

Here's a list of the top 10 most famous non-human animal statues, in no particular order:

1. **The Lion at the Louvre** (Paris, France): A majestic bronze lion statue guarding the entrance to the world-famous Louvre Museum.
2. **The Charging Bull** (New York City, USA): A iconic bronze sculpture of a bull, symbolizing strength and resilience, located in Lower Manhattan's Financial District.
3. **The Horses of Saint-Mark's Basilica** (Venice, Italy): Four magnificent bronze horses adorning the facade of Saint-Mark's Basilica, a masterpiece by Italian sculptor Barletta.
4. **The Lion of Belfort** (Belfort, France): A massive 15-meter-tall stone lion statue guarding the citadel in eastern France.
5. **The Elephant Gates** (Paris, France): Two colossal bronze elephants flanking the entrance to the Musée d'Orsay, a famous art museum.
6. **The Bronze Horseman** (St. Petersburg, Russia): A stunning equestrian statue of Peter the Great, located in Senate Square, symbolizing Russian history and culture.
7. **The Fountain of the Four Rivers** (Rome, Italy): A majestic bronze statue depicting four rivers (Nile, Ganges, Danube, and Rio Plata) being personified as women, located in the Piazza Navona.
8. **The Lion of Glasgow** (Glasgow, Scotland): A grand stone lion statue standing guard outside the City Chambers building in Glasgow's George Square.
9. **The Horse Guards** (London, UK): Two imposing bronze horse statues situated at the entrance to Buckingham Palace, symbolizing British military prowess.
10. **The Sea Lions of the Golden Gate Bridge** (San Francisco, USA): Two iconic bronze sea lion statues guarding the Golden Gate Bridge, a popular tourist attraction.

insane ...

If I lay 2 wet shirts out in the sun flat on the ground to dry and it takes 5 hours until they are dry, how long would it take to dry 4 shirts of the same type that way simultanously?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> A great question about drying shirts!

The time it takes for the shirts to dry is influenced by factors like air circulation, temperature, and humidity. Since you're laying them flat on the ground in the sun, we can assume that these conditions are similar for all four shirts.

Since it took 5 hours to dry 2 shirts simultaneously, it's likely that the drying process is not affected by the number of shirts being dried. This means that the same amount of time would be required to dry 4 shirts as well!

So, if it takes 5 hours to dry 2 shirts, it would still take approximately 5 hours to dry all 4 shirts simultaneously
There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> The classic lateral thinking puzzle!

Let's break it down: "two ducks in front of a duck" means that one duck is standing in front of another.

"Two ducks behind a duck" implies that this same duck is being followed by two more ducks.

And, of course, there's the "duck in the middle".

So, if we combine all these clues, we can conclude that there are... 3 ducks!

THAT llama3 8b is INSANE.
Better than anything I saw even better than wizard 2 8x22b ! ( question with apple sentence, etc )
That is a huge leap to other dimension for small models performance ... I just do not believe such a small model ( 8b ! ) can be so smart and has a such high reasoning capability.
I am afraid of testing 70b version ;)

@EliEron
Copy link

EliEron commented Apr 18, 2024

@mirek190 This is different from what is said in the meta-llama/llama3:

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Look at my version above, it seems more correct to me.

According to

https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

My template is cottect.

Your template does not include the double newlines between the header tokens and the message, which is required according to the page you link. That's the main difference between your and Jipok's template.

@ddh0
Copy link
Contributor

ddh0 commented Apr 18, 2024

template for llama 3 is

--in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> "

also you have to add --reverse-prompt "assistant" as well as is spamming badly ;)

This can't be right, because now the model is not allowed to say the word "assistant".....

@mirek190
Copy link

mirek190 commented Apr 18, 2024

so, how it should look like for llamacpp?
Because for me works fine.

#6747 (comment)

ps you are right this one is better.
I changed a bit and works perfect now!

main.exe --model models/new3/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color --threads 30 --keep -1 --batch-size 512 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"

Create 10 sentences that ends with a word "apple". Remember the word "apple" MUST be at the end.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere are ten sentences that end with the word "apple":

1. I love to eat a juicy red apple.
2. The farmer carefully picked the ripe green apple.
3. She carefully polished her shiny silver apple.
4. He tossed the football high into the crisp autumn air apple.
5. My friend from France brought me a delicious French apple.
6. You can't go wrong with a classic Granny Smith apple.
7. I'm craving a crunchy sweet tart juicy red apple.
8. The little girl held her favorite stuffed cuddly soft teddy bear apple.
9. We had a picnic under the shade of a tall old oak tree apple.
10. The new employee brought in fresh healthy snacks like crunchy granola bars and a crisp Granny Smith apple.

### Output:

I hope this meets your request!

that 8b llama3 is insane !
Everything before llama3 is a trash now...

@SamuelTallet
Copy link
Contributor

This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it

image

@DifferentialityDevelopment According to Meta specifications:

After a double newline \n\n the contents of the message follow.

You are missing the \n\n between "<|end_header_id|>" and << trim(message->content)

@SamuelTallet
Copy link
Contributor

SamuelTallet commented Apr 18, 2024

anyone that can give me a brief primer on how I'd actually submit the changes as a pull request?

@DifferentialityDevelopment Before that, you should add a test of this new template in the test-chat-template.cpp file.
There are details and a Python helper script in this Wiki page. By the way, you should add this new template in the Wiki.

To submit the changes: fork llama.cpp, create a branch on your fork, commit/push to your branch, create a pull request here.

@DifferentialityDevelopment
Copy link
Contributor

DifferentialityDevelopment commented Apr 18, 2024

This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it
image

@DifferentialityDevelopment According to Meta specifications:

After a double newline \n\n the contents of the message follow.

You are missing the \n\n between "<|end_header_id|>" and << trim(message->content)

I've added the missing newlines after end_header_id, thanks for spotting this, also I did add two lines required in test-chat-template.cpp

Pull request is here: #6751

@MoonRide303
Copy link

@DifferentialityDevelopment But it would be nice if Llama 3 chat template would be supported natively in both main and server - especially taking into account name of this project ;).

@DifferentialityDevelopment
Copy link
Contributor

@DifferentialityDevelopment But it would be nice if Llama 3 chat template would be supported natively in both main and server - especially taking into account name of this project ;).

Pretty sure the changes I made just affect the server, which is what I mainly use, I integrated Llama.cpp into my C# applications using the server
As for supporting it in main, I agree it is a bit of a pain that prefix, suffix etc all has to be specified,
possibly you could add a --llama3 argument that would work similiar to the --chatml argument, but I do wonder why chat templates aren't part of the main, main can be massively simplified with chat templates as it does away with the prefix, suffix, reverse prompts etc.

@egeres
Copy link

egeres commented Apr 19, 2024

I pulled the last changes and recompiled, but I get a main: error: unable to load model because of error loading model: unexpectedly reached end of file when I run /mnt/c/Github/llama.cpp/build/bin/main --n_predict 256 --threads 4 --no-mmap --color --seed 1 --ignore-eos --prompt "a" --n-gpu-layers 99 --model /mnt/d/AI/Meta-Llama-3-8B-Instruct-Q8_0.gguf, I got the weights from @DifferentialityDevelopment in https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

Is anybody else having the same problem? I'm not sure if it's related to llama 3

Full output
Log start
main: build = 2699 (0e4802b2)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /mnt/d/AI/Meta-Llama-3-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct-imatrix
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct-imatrix
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
ggml_cuda_host_malloc: warning: failed to allocate 532.31 MiB of pinned memory: out of memory
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  5746.81 MiB
llm_load_tensors:      CUDA1 buffer size =  1858.52 MiB
llama_model_load: error loading model: unexpectedly reached end of file
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/d/AI/Meta-Llama-3-8B-Instruct-Q8_0.gguf'
main: error: unable to load model

@egeres
Copy link

egeres commented Apr 19, 2024

Apparently the weights from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF load just fine!

@DifferentialityDevelopment
Copy link
Contributor

I pulled the last changes and recompiled, but I get a main: error: unable to load model because of error loading model: unexpectedly reached end of file when I run /mnt/c/Github/llama.cpp/build/bin/main --n_predict 256 --threads 4 --no-mmap --color --seed 1 --ignore-eos --prompt "a" --n-gpu-layers 99 --model /mnt/d/AI/Meta-Llama-3-8B-Instruct-Q8_0.gguf, I got the weights from @DifferentialityDevelopment in https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

Is anybody else having the same problem? I'm not sure if it's related to llama 3

Full output

Sounds like corrupted download, that model works perfectly for me

@l0d0v1c
Copy link

l0d0v1c commented Apr 19, 2024

8b works for me but same loading error with 70B

@jin-eld
Copy link

jin-eld commented Apr 20, 2024

I can confirm, that https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q8_0.gguf works fine with llama_cpp_python 0.2.62 and main/HEAD (a4b732c30b9ec798509430a1478e8c9700b46bc3) of text-generation-webui out of the box. I did not have to do any manual prompt settings, seems they were picked up from the model.

I am not entirely sure which version of llama.cpp is being pulled in by the python wrapper though, but I guess it is recent enough as it works just fine.

Answers are almost too verbose for my taste, I guess this can be tuned via parameters, but the quality of the answers is really great so far! I run the model split over 3 MI25 GPUs and it's super fast too!

EDIT: it seems longer answers get cut off in the UI, I am not quite sure if it's related to a setting that I may be missing, to llama.cpp or to the text-generation-webui... What's the best way to figure out why this is happening?

@abasu0713
Copy link

Hey there..
Really wanted to take this out for a spin on the new llama3 models..

I tried quantizing it.. But I get the following error. I am relatively new to GenAI. So any help would be appreciated:

Screenshot 2024-04-20 at 1 59 34 AM

@DifferentialityDevelopment
Copy link
Contributor

Hey there.. Really wanted to take this out for a spin on the new llama3 models..

I tried quantizing it.. But I get the following error. I am relatively new to GenAI. So any help would be appreciated:

Screenshot 2024-04-20 at 1 59 34 AM

Check this pull request :)
#6745

@Dampfinchen
Copy link

Dampfinchen commented Apr 20, 2024

Can llamacpp now effectively support Llama3?

Yes, it can. This issue should be closed.

I've heard lots of reports that IQ and imatrix quants are broken.

@jart
Copy link
Contributor

jart commented Apr 20, 2024

On the llamafile project, we're using Mozilla-Ocho/llamafile@da4d780 as a workaround to the stop token issue. It fixes the issue with llama3 rambling on for 70B, but doesn't appear to work for 8B.

@jin-eld
Copy link

jin-eld commented Apr 20, 2024

Did anyone already try https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main ? It says it was reuploaded with the new end token.

I am also a bit confused about the files, there are three files for Q8 - are they supposed to be concatenated after the download or will llama.cpp handle a model that is split over multiple files?

I am a bit hesitant to be the first one to try, because it'd take me a few days to download on my connection and will eat a huge portion of my monthly traffic...

@murp-2075
Copy link

python convert.py ./models/Meta-Llama-3-8B
Loading model file models/Meta-Llama-3-8B/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('models/Meta-Llama-3-8B'))
Traceback (most recent call last):
File "/Users/me/projects/llama.cpp/convert.py", line 1548, in
main()
File "/Users/me/projects/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/me/projects/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/me/projects/llama.cpp/convert.py", line 1402, in _create_vocab_by_path
vocab = cls(self.path)
^^^^^^^^^^^^^^
File "/Users/me/projects/llama.cpp/convert.py", line 462, in init
self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/me/projects/llama.cpp/env/lib/python3.11/site-packages/sentencepiece/init.py", line 447, in Init
self.Load(model_file=model_file, model_proto=model_proto)
File "/Users/me/projects/llama.cpp/env/lib/python3.11/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/me/projects/llama.cpp/env/lib/python3.11/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Anyone have any ideas why I'm getting this?

@reneleonhardt
Copy link
Contributor

Anyone have any ideas why I'm getting this?

Did you checkout the proposed conversion support? 🙂 #6745

@phymbert phymbert added the model Model specific label Apr 20, 2024
@stefanvarunix
Copy link

Did anyone already try https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main ? It says it was reuploaded with the new end token.

I am also a bit confused about the files, there are three files for Q8 - are they supposed to be concatenated after the download or will llama.cpp handle a model that is split over multiple files?

I am a bit hesitant to be the first one to try, because it'd take me a few days to download on my connection and will eat a huge portion of my monthly traffic...

Yes, QuantFactory's re-upload works like a charme! (I tried 70B Q8)

You just need to point llama.cpp (e.g. ./server) to the first file, e.g. Meta-Llama-3-70B-Instruct.Q8_0-00001-of-00003.gguf). It will load the others.

@phymbert
Copy link
Collaborator

I think we are good here, please reopen any case

@freddyouellette
Copy link

Has anyone been able to successfully convert the 70B model using convert.py? I'm running:

python convert.py ~/dev/Meta-Llama-3-70B-Instruct/ --outtype f16 --vocab-type bpe

and getting:

Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00001-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00001-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00002-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00003-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00004-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00005-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00006-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00007-of-00030.safetensors
Loading model file /home/freddy/dev/Meta-Llama-3-70B-Instruct/model-00008-of-00030.safetensors
Traceback (most recent call last):
  File "/home/freddy/dev/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/home/freddy/dev/llama.cpp/convert.py", line 1487, in main
    model_plus = load_some_model(args.model)
  File "/home/freddy/dev/llama.cpp/convert.py", line 1376, in load_some_model
    models_plus.append(lazy_load_file(path))
  File "/home/freddy/dev/llama.cpp/convert.py", line 980, in lazy_load_file
    return lazy_load_safetensors_file(fp, path)
  File "/home/freddy/dev/llama.cpp/convert.py", line 959, in lazy_load_safetensors_file
    model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
  File "/home/freddy/dev/llama.cpp/convert.py", line 959, in <dictcomp>
    model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
  File "/home/freddy/dev/llama.cpp/convert.py", line 951, in convert
    assert 0 <= begin <= end <= len(byte_buf)
AssertionError

I have checked the file integrity with git lfs fsck, and it came back OK.

@ggerganov
Copy link
Owner

It works for me:

$ ▶ python3 convert.py ~/Data/huggingface/Meta-Llama-3-70B-Instruct/ --outfile ./x.gguf --outtype f16 --vocab-type bpe
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00001-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00001-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00002-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00003-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00004-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00005-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00006-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00007-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00008-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00009-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00010-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00011-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00012-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00013-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00014-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00015-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00016-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00017-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00018-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00019-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00020-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00021-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00022-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00023-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00024-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00025-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00026-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00027-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00028-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00029-of-00030.safetensors
Loading model file /Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/model-00030-of-00030.safetensors
params = Params(n_vocab=128256, n_embd=8192, n_layer=80, n_ctx=8192, n_ff=28672, n_head=64, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct'))
Loaded vocab file PosixPath('/Users/ggerganov/Data/huggingface/Meta-Llama-3-70B-Instruct/tokenizer.json'), type 'bpe'
Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
Permuting layer 0
Permuting layer 1
Permuting layer 2
...

@freddyouellette
Copy link

I figured it out, somehow my .safetensors files were modified. I fixed it with:

cd Meta-Llama-3-70B-Instruct/
git reset --hard
git lfs checkout

And then the conversion worked. Now I just have to find the RAM to run it locally 😅

@vincent0318
Copy link

vincent0318 commented Apr 23, 2024

Does anyone also get this problem?
error: byte not found in vocab: ' ' segmentation fault

I tried to follow the step in https://huggingface.co/IlyaGusev/saiga_llama3_8b_gguf
wget https://raw.githubusercontent.com/IlyaGusev/rulm/master/self_instruct/src/interact_llama3_llamacpp.py
wget https://raw.githubusercontent.com/IlyaGusev/rulm/master/self_instruct/src/interact_llama3_llamacpp.py
pip install llama-cpp-python fire
python3 interact_llama3_llamacpp.py model-q4_K.gguf
I successfully executed the code only the first time but keep getting the error now

@tushartiwariofficial
Copy link

Calling convert-hf-to-gguf.py ends up with FileNotFoundError: Cannot find Llama BPE tokenizer. Trying to use tokenizer.model from original folder in https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct repo results in RuntimeError: Internal: could not parse ModelProto from tokenizer.model.

Did you find a solution to this error?

@mirekphd
Copy link

the missing newlines after end_header_id,

Compared to Mistral's simplicity Meta's prompt format is not just over-engineered to the point of illegibility, but also eats up their rather small context window... (8 times smaller than new Mixtral's). But I suppose that's what's required when you are dealing with 15 trillion tokens...:)

@felrock
Copy link
Contributor

felrock commented Apr 25, 2024

If you download the weights from Meta's website/download.sh, use this: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

@tonandr
Copy link

tonandr commented May 7, 2024

python convert.py Meta-Llama-3-8B-Instruct --outtype f16 --vocab-type bpe

Need to add --vocab-type bpe as mentioned here: #6745 (comment)

In my case, after update from a previous one to the latest llama.cpp (tag: b2797), its conversion became valid.

@binhmed2lab
Copy link

binhmed2lab commented Jul 2, 2024

Anyone know how to setup for batched API called correctly for Llama-3? I use the same preprocessing code on Llama-cpp-python (doesn't support batched inference) on my dataset - the accuracy is 40%. However, Llama-cpp is only 25%. I use the same model.

Cmd runserver: !./build/bin/llama-server -m Meta-Llama-3-8B-Instruct.Q4_0.gguf -ngl 9999 -c 0 --port 12345 -np 5 -cb --ctx-size 8092

Code to process and call API:

import json

def encode_header(message):
    tokens = f"<|start_header_id|>{message['role']}<|end_header_id|>\n\n"
    return tokens

def encode_message(message):
    tokens = encode_header(message)
    tokens += message["content"].strip() + "<|eot_id|>"
    return tokens

def encode_dialog_prompt(dialog):
    tokens = "<|begin_of_text|>"
    for message in dialog:
        tokens += encode_message(message)
        
    tokens += encode_header({"role": "assistant", "content": ""})
    return tokens

dialog = [
    {"role": "system", "content": "You are going to write Python code to solve the math problem Evaluate $\log_{\sqrt8}(64\sqrt{8})$."},
    {"role": "user", "content": "Evaluate $\log_{\sqrt8}(64\sqrt{8})$."}
]

batched_prompts = [encode_dialog_prompt(dialog) for _ in range(4)]

# Specify the URL
url = 'http://localhost:12345/completion'

# Define the headers
headers = {
    'Content-Type': 'application/json',
}
data = {
    "prompt": batched_prompts,
    "n_predict": 512,
}

json_data = json.dumps(data)
    
# Send the POST request
response = requests.post(url, headers=headers, data=json_data)

My code to setup Llama-cpp-python:

llm = Llama(
    model_path=model_path,
    n_ctx=4096,
    n_threads=2,
    n_gpu_layers=-1,
    verbose = False,
    chat_format="llama-3"
)

dialog = [
    {"role": "system", "content": "You are going to write Python code to solve the math problem Evaluate $\log_{\sqrt8}(64\sqrt{8})$."},
    {"role": "user", "content": "Evaluate $\log_{\sqrt8}(64\sqrt{8})$."}
]

chat = llm.create_chat_completion(
      messages = dialog,
      temperature = 0.7,
      max_tokens = 512,
      top_p = 0.9,
      stop=["</s>"],
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific
Projects
None yet
Development

No branches or pull requests