Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama 2 model #2262

Closed
tikikun opened this issue Jul 18, 2023 · 95 comments
Closed

Add llama 2 model #2262

tikikun opened this issue Jul 18, 2023 · 95 comments
Labels
model Model specific 🦙. llama

Comments

@tikikun
Copy link
Contributor

tikikun commented Jul 18, 2023

Meta just released llama 2 model, allowing commercial usage

https://ai.meta.com/resources/models-and-libraries/llama/

I have checked the model implementation and it seems different from llama_v1, maybe need a re-implementation

@Green-Sky
Copy link
Collaborator

link to paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

@Azeirah
Copy link
Contributor

Azeirah commented Jul 18, 2023

Interesting to note that the model evaluation section in their paper lists a 34b model even though the site doesn't talk about it. I wonder if it'll be available.

Does anyone have access to the models yet? I signed up but haven't received an e-mail. It's not super clear to me if it's meant to be instant or not.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 18, 2023

Interestingly, the paper talks about a 34B model, which is missing from the model card.
edit: @Azeirah was faster lol

@slaren
Copy link
Collaborator

slaren commented Jul 18, 2023

The paper implies that they are planning to release the 34B model later.
image

@Green-Sky
Copy link
Collaborator

@Azeirah no, i did not hear back yet either.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

also, they are available on hf if your email is the same https://huggingface.co/meta-llama

@Azeirah
Copy link
Contributor

Azeirah commented Jul 18, 2023

I was really hopeful for an alternative to gpt-4 for coding assistance, but the evaluation states their 70B model is about equivalent in performance to gpt-3.5.

Not bad, but the jump in quality from 3.5 to 4 has been what it made it really useful in day-to-day coding tasks. ;(

Screenshot 2023-07-18 at 19 05 26

At the very least, it does look like the 7B and 13B variants will be amazing local chatbots for low perf devices.

@dmadisetti
Copy link

I just got access, but the download is flaky, check sums are not matching and the auth is hit or miss.
Notable is the chat specific models:

https://github.com/facebookresearch/llama/blob/main/download.sh#L24C1-L43C7

Will update if I am actually able to download these weights

@goranmoomin
Copy link

The updated model code for Llama 2 is at the same facebookresearch/llama repo, diff here: meta-llama/llama@6d4c0c2

Seems codewise, the only difference is the addition of GQA on large models, i.e. the repeat_kv part that repeats the same k/v attention heads on larger models to require less memory for the k/v cache.

According to the paper, smaller models (i.e. the 7b/13b ones) don't have GQA, so in theory it seems it should be able to run unmodified.

@dmadisetti
Copy link

Email below with tracking links stripped. Same as llama-1 for the most part. Now if it would actually download.....


You’re all set to start building with Llama 2.

The models listed below are now available to you as a commercial license holder. By downloading a model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta’s privacy policy.

Model weights available:

Llama-2-7b
Llama-2-7b-chat
Llama-2-13b
Llama-2-13b-chat
Llama-2-70b
Llama-2-70b-chat

With each model download, you’ll receive a copy of the Llama 2 Community License and Acceptable Use Policy, and can find all other information on the model and code on GitHub.

How to download the models:

Visit GitHub and clone [the Llama repository](https://github.com/facebookresearch/llama) from there in order to download the model code
Run the download.sh script and and follow the prompts for downloading the models.
When asked for your unique custom URL, please insert the following:
<redacted for legal reasons>
Select which model weights to download

The unique custom URL provided will remain valid for model downloads for 24 hours, and requests can be submitted multiple times.
Now you’re ready to start building with Llama 2.

Helpful tips:
Please read the instructions in the GitHub repo and use the provided code examples to understand how to best interact with the models. In particular, for the fine-tuned chat models you must use appropriate formatting and correct system/instruction tokens to get the best results from the model.

You can find additional information about how to responsibly deploy Llama models in our Responsible Use Guide.

If you need to report issues:
If you or any Llama 2 user becomes aware of any violation of our license or acceptable use policies - or any bug or issues with Llama 2 that could lead to any such violations - please report it through one of the following means:

Reporting issues with the model: Llama GitHub
Giving feedback about potentially problematic output generated by the model: [Llama output feedback](https://developers.facebook.com/llama_output_feedback)
Reporting bugs and security concerns: [Bug Bounty Program](https://facebook.com/whitehat/info)
Reporting violations of the Acceptable Use Policy: [LlamaUseReport@meta.com](mailto:LlamaUseReport@meta.com)

Subscribe to get the latest updates on Llama and Meta AI.

Meta’s GenAI Team

@swyxio
Copy link

swyxio commented Jul 18, 2023

anyone else also randomly getting

Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.

for the small files? but /llama-2-7b-chat/consolidated.00.pth is downloading fine it seems. will share checksums when i have them

@BetaDoggo
Copy link

I tried the 7B and it seems to be working fine, with cuda acceleration as well.

@Azeirah
Copy link
Contributor

Azeirah commented Jul 18, 2023

anyone else also randomly getting

Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.

for the small files? but /llama-2-7b-chat/consolidated.00.pth is downloading fine it seems. will share checksums when i have them

I genuinely just think their servers are a bit overloaded given what I see posted here. It's a big release

@trrahul
Copy link

trrahul commented Jul 18, 2023

Yeah the GGML models are on hf now.
https://huggingface.co/TheBloke/Llama-2-7B-GGML
https://huggingface.co/TheBloke/Llama-2-13B-GGML

@Azeirah
Copy link
Contributor

Azeirah commented Jul 18, 2023

Yeah the GGML models are on hf now.
https://huggingface.co/TheBloke/Llama-2-7B-GGML
https://huggingface.co/TheBloke/Llama-2-13B-GGML

Thebloke is a wizard O_O

@Johnhersh
Copy link

Yeah the GGML models are on hf now.
https://huggingface.co/TheBloke/Llama-2-7B-GGML
https://huggingface.co/TheBloke/Llama-2-13B-GGML

These worked as-is for me

@LoganDark
Copy link
Contributor

LoganDark commented Jul 18, 2023

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

@Azeirah
Copy link
Contributor

Azeirah commented Jul 18, 2023

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

@Johnhersh
Copy link

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

image

@LoganDark
Copy link
Contributor

LoganDark commented Jul 18, 2023

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

As in, renting a VPS or dedicated server just to quantize + upload? (actually, come to think of it, that is an official recommendation by huggingface, wouldn't be surprised...)

@LoganDark
Copy link
Contributor

LoganDark commented Jul 18, 2023

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

@Azeirah
Copy link
Contributor

Azeirah commented Jul 18, 2023

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.
image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

@Johnhersh
Copy link

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.
image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin

@LoganDark
Copy link
Contributor

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.
image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin

q4_0 should be even faster for only slightly less accuracy

@Green-Sky
Copy link
Collaborator

iirc q4_1 has an outdated perf/size tradeoff, use one of the kquants instead. (or q4_0)

@nullhook
Copy link

nullhook commented Jul 18, 2023

image

inferencing with q4_1 on M1 Max (64GB)

2.99 ms per token is slow

@LoganDark
Copy link
Contributor

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.
image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

huh nevermind

image

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

@Johnhersh
Copy link

huh nevermind

image

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

How do you offload the layers?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 19, 2023

I was using @TheBloke's quantized 7B model.

Just passed the args -c 4096 and no scaling and a big file (>3000 tokens) with -f and it was generating coherent text.

@ggerganov
Copy link
Owner

I think I have a 70B prototype here: #2276

Needs some more work and not 100% sure it is correct, but text generation looks coherent.

@wizzard0
Copy link
Contributor

Note #2276 breaks non-GQA models:

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  4096 x   512, got  4096 x  4096
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-7b.ggmlv3.q2_K.bin'
main: error: unable to load model

@TikkunCreation
Copy link

TikkunCreation commented Jul 19, 2023

So the chat model uses something like

{BOS}[INST] <<SYS>>
{system}
<</SYS>>

{instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]

The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.

For clarity, it uses <s> and </s> for EOS and BOS (I checked with a python script using tokenizer.model)

@jxy
Copy link
Contributor

jxy commented Jul 19, 2023

I made a simple change to main to add BOS.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index bcbcf12..5906cde 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -605,6 +605,8 @@ int main(int argc, char ** argv) {
             // replace end of text token with newline token when in interactive mode
             if (id == llama_token_eos() && params.interactive && !params.instruct) {
                 id = llama_token_newline.front();
+                embd_inp.push_back(llama_token_bos());
+                is_interacting = true;
                 if (params.antiprompt.size() != 0) {
                     // tokenize and inject first reverse prompt
                     const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false);

and run it like

./main -m "$MODEL" -c 4096 -n -1 --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p \
"[INST] <<SYS>>
$SYSTEM
<</SYS>>

$FIRST_MESSAGE [/INST]"

I don't know if we want an argument like --insert-bos-after-eos to main.

Regarding <s> and </s>, main or server cannot encode those to BOS or EOS.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 19, 2023

I think inp_pfx and inp_sfx should also be changed?

@XiongjieDai
Copy link

Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?

If you have Git Bash installed, you can run the .sh file from the Git Bash command line with: bash path/to/script.sh

@jxy
Copy link
Contributor

jxy commented Jul 20, 2023

I think inp_pfx and inp_sfx should also be changed?

Those are hard coded for the instruct mode

  -ins, --instruct      run in instruction mode (use with Alpaca models)

@ziwang-com
Copy link

ziwang-com commented Jul 20, 2023

Global launch, llama2-map module library frame composition

【23-7-20】全球首发,llama2-map模块库架构图
https://github.com/ziwang-com/AGI-MAP

llama2_generation

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 20, 2023

@ziwang-com those are just callgraphs for the python code. I'm sorry, but the python code already is simple to read as is, we don't really need those images. (also imho they feel harder to read than the python code)

@sowa705
Copy link

sowa705 commented Jul 20, 2023

I think inp_pfx and inp_sfx should also be changed?

Those are hard coded for the instruct mode

  -ins, --instruct      run in instruction mode (use with Alpaca models)

Would it be possible to move them into the model file? That would solve the issue of different models having different prompt formats

@viniciusarruda
Copy link

Is Meta tokenizer identical to llama_cpp tokenizer? I think it should be. But I'm having a issue while decoding/encoding.
This is also related to the chat completion format already mentioned above by @kharvd @jxy @TikkunCreation
You can see the issue in details and also replicate it here. I'm comparing Meta original tokenizer with a model from @TheBloke .

@jxy
Copy link
Contributor

jxy commented Jul 21, 2023

for llama-2-chat, #2304

@jxy
Copy link
Contributor

jxy commented Jul 21, 2023

and server, #2306

@ggerganov
Copy link
Owner

70B support should be ready to merge in #2276

Btw, I did some tests with 7Bv2 and the generated texts from short prompts using Q4_0 and Q5_0 definitely feel weird. I wrote more about it in the PR description. Would be nice if other people confirm the observations.

@kurnevsky
Copy link
Contributor

It doesn't work with the following input:

llama-cpp -c 4096 -gqa 8 -t 16 -m llama-2-70b.ggmlv3.q4_K_M.bin -p "### HUMAN:\na\n\n### RESPONSE:\nb\n\n### HUMAN:\nb\n\n### RESPONSE:"

The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.

@WiSaGaN
Copy link

WiSaGaN commented Aug 18, 2023

The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.

It worked in the vanilla case for me, but got similar error when I run the binary from "make LLAMA_CLBLAST=1". "-gqa 8" was added in both cases.

@kurnevsky
Copy link
Contributor

I actually do use LLAMA_CLBLAST, but tested without gpu offloading - didn't know it affects the execution somehow :)
And I got this error on the model from https://huggingface.co/TheBloke/Llama-2-70B-GGML

@Nyceane
Copy link

Nyceane commented Sep 14, 2023

@kurnevsky I am having same problem, are you able to fix it?

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Sep 15, 2023

I am having same problem, are you able to fix it?

See #3002. Known workarounds are to not use the OpenCL backend with LLaMA 2, or to not use k-quants (Q*_K).

@kleenkanteen
Copy link

@tikikun What do you mean to add the llama 2 model when this repo about the llama model? Also on the main page why does it say "Supported models:" and then lists a bunch of other LLMs when this repo is just about llama?

@ggerganov
Copy link
Owner

LLaMA v2 and many other models are currently supported by llama.cpp.
See the status page for more info

@kleenkanteen
Copy link

kleenkanteen commented Oct 18, 2023 via email

@ggerganov
Copy link
Owner

No, llama.cpp can run inference for all model architectures listed in the status page. It started just with LLaMA v1, but since then there has been a lot of progress and it now supports a variety of models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific 🦙. llama
Projects
None yet
Development

No branches or pull requests