Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for llama.cpp #447

Merged
merged 13 commits into from
Mar 31, 2023
Merged

Conversation

thomasantony
Copy link
Contributor

@thomasantony thomasantony commented Mar 20, 2023

My proof of concept for adding support for llama.cpp. It requires my experimental Python bindings (v0.1.9 and up). This has no dependencies (with some caveats mentioned below).

This is what is in models/llamacpp-7B right now

> ls -al  models/llamacpp-7B/
total 70746800
drwxr-xr-x@ 9 thomas  staff          288 Mar 19 17:59 .
drwxr-xr-x@ 9 thomas  staff          288 Mar  10 22:04 ..
-rw-r--r--@ 1 thomas  staff          100 Mar  10 22:04 checklist.chk
-rw-r--r--  1 thomas  staff          118 Mar 19 18:00 config.json
-rw-r--r--@ 1 thomas  staff  13476939516 Mar  10 22:35 consolidated.00.pth
-rw-r--r--  1 thomas  staff  13477682665 Mar 11 13:40 ggml-model-f16.bin
-rw-r--r--  1 thomas  staff   4212727273 Mar 12 18:39 ggml-model-q4_0.bin
-rw-r--r--  1 thomas  staff   5054995945 Mar 12 19:22 ggml-model-q4_1.bin
-rw-r--r--@ 1 thomas  staff          101 Mar  10 22:03 params.json
  • Only the ggml-model-q4_0.bin file is required right now (and it is hardcoded). The bigger models should also work as long as the folder names start with llamacpp- or alpaca-cpp-.

  • The model files can be created from the PyTorch model using the llamacpp-convert and llamacpp-quantize commands that are installed along with the llamacpp. Using these commands requires that torch and sentencepiece be installed as well.

  • There is currently no option to update the parameters like top_p, top_k etc. other than hardcoding it in llamacpp_model.py. This is on my todo list of things to fix.

@oobabooga
Copy link
Owner

This is really promising.

@madmads11
Copy link

I see that llama.cpp has added a C-style API, exciting stuff!

@thomasantony
Copy link
Contributor Author

I see that llama.cpp has added a C-style API, exciting stuff!

Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.

@madmads11
Copy link

I see that llama.cpp has added a C-style API, exciting stuff!

Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.

I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project!

@TheTerrasque
Copy link
Contributor

There's also this project that might be useful: https://github.com/PotatoSpudowski/fastLLaMa

@x-legion
Copy link

@thomasantony
Copy link
Contributor Author

I see that llama.cpp has added a C-style API, exciting stuff!

Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.

I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project!

His new API is quite a bit cleaner than my previous work which was sort of put together quickly. However, the new one is a bit more minimal, so I need some additional work around it to get it to where it was before. I am currently blocked on a segfault that I can hopefully get to later today or tomorrow.

@BadisG
Copy link
Contributor

BadisG commented Mar 27, 2023

@thomasantony Will the llama.cpp be placed in the "repositories" folder, similar to "GPTQ-for-LLaMa"? If so, that's great as updating the web-ui will also result in an update of the llama.cpp repository.

@thomasantony thomasantony force-pushed the feature/llamacpp branch 2 times, most recently from 1c91078 to 6a0fff1 Compare March 29, 2023 20:28
@thomasantony thomasantony marked this pull request as ready for review March 29, 2023 20:28
@thomasantony thomasantony changed the title Draft: Add support for llama.cpp Initial support for llama.cpp Mar 29, 2023
@oobabooga
Copy link
Owner

oobabooga commented Mar 30, 2023

I like the code so far and appreciate that it adheres to the style/structure of the project.

@oobabooga
Copy link
Owner

oobabooga commented Mar 31, 2023

@thomasantony I have made some changes that made this functional for me. The main parameters are all used: temperature, top_k, top_p, and repetition_penalty.

These were the steps to get it working:

  1. Install version 0.1.10 of llamacpp:
pip install llamacpp==0.1.10
  1. Create the folder models/llamacpp-7b
  2. Put this file in it: ggml-model-q4_0.bin
  3. Start the web UI with
python server.py --model llamacpp-7b

After that it worked.

@thomasantony
Copy link
Contributor Author

thomasantony commented Mar 31, 2023

Thanks for the changes. I just released v0.1.11 - this includes the new memory mapped I/O feature and requires updating the weight files. But it makes loading the models a whole lot faster (and may allow running models bigger than your RAM but I have not tried it yet and may be wrong about this).

The API should be consistent and work with textui with out any changes.

@oobabooga
Copy link
Owner

Is it possible to find the new weights on hugging face somewhere?

@thomasantony
Copy link
Contributor Author

thomasantony commented Mar 31, 2023

You can use the updated "llamacpp-convert" script with the original Llama weights (pytorch format) to generate the new ggml weights. Another option is to use the "migrate" script from https://github.com/ggerganov/llama.cpp . That can convert existing "ggml" weights into the new format.

@madmads11
Copy link

Would this support using and interacting with alpaca and llama models of all sizes?

@oobabooga
Copy link
Owner

@thomasantony I did the convertion from the base LLaMA files and that worked.

This was the performance of llama-7b int4 on my i5-12400F:

Output generated in 44.10 seconds (4.53 tokens/s, 200 tokens)

@thomasantony
Copy link
Contributor Author

thomasantony commented Mar 31, 2023

Well, feel free to merge it! . I am glad that I was able to contribute. :).

@oobabooga
Copy link
Owner

Thank you so much for this brilliant PR, @thomasantony!

The new documentation is here: https://github.com/oobabooga/text-generation-webui/wiki/llama.cpp-models

@oobabooga oobabooga merged commit 6fd70d0 into oobabooga:main Mar 31, 2023
@thomasantony thomasantony deleted the feature/llamacpp branch March 31, 2023 18:30
@oobabooga
Copy link
Owner

oobabooga commented Mar 31, 2023

@thomasantony I have just noticed that the parameters are not really being used. Assigning to the params variable here doesn't change the parameters inside the model: https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_model.py#L43

Is the only way to change the model parameters to reload it from scratch like this?

_model = llamacpp.LlamaInference(params)

@thomasantony
Copy link
Contributor Author

@oobabooga That is a side effect of how the underlying Python bindings works right now. Adding support for changing those parameters when sampling from the logits is on my ToDo list. Right now, it is only possible if you use the LlamaContext class (which is more low level), instead of the higher-level LlamaInference which currently does not allow changing the parameters post-initialization. This is probably the next thing I will update in the library. I will post back here once that is done, or make a separate PR with the changes.

@niizam
Copy link

niizam commented Apr 1, 2023

is it possible to use vram and ram like this? https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

@thomasantony
Copy link
Contributor Author

@niizam 4-bit quantized models are already supported. You just need to use the appropriate weight files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants