-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for llama.cpp #447
Conversation
392d9ca
to
0e8dd1d
Compare
This is really promising. |
I see that llama.cpp has added a C-style API, exciting stuff! |
Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in. |
I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project! |
There's also this project that might be useful: https://github.com/PotatoSpudowski/fastLLaMa |
will it support https://github.com/AlpinDale/pygmalion.cpp |
His new API is quite a bit cleaner than my previous work which was sort of put together quickly. However, the new one is a bit more minimal, so I need some additional work around it to get it to where it was before. I am currently blocked on a segfault that I can hopefully get to later today or tomorrow. |
@thomasantony Will the llama.cpp be placed in the "repositories" folder, similar to "GPTQ-for-LLaMa"? If so, that's great as updating the web-ui will also result in an update of the llama.cpp repository. |
1c91078
to
6a0fff1
Compare
I like the code so far and appreciate that it adheres to the style/structure of the project. |
6a0fff1
to
7fa5d96
Compare
@thomasantony I have made some changes that made this functional for me. The main parameters are all used: temperature, top_k, top_p, and repetition_penalty. These were the steps to get it working:
After that it worked. |
Thanks for the changes. I just released v0.1.11 - this includes the new memory mapped I/O feature and requires updating the weight files. But it makes loading the models a whole lot faster (and may allow running models bigger than your RAM but I have not tried it yet and may be wrong about this). The API should be consistent and work with textui with out any changes. |
Is it possible to find the new weights on hugging face somewhere? |
You can use the updated "llamacpp-convert" script with the original Llama weights (pytorch format) to generate the new ggml weights. Another option is to use the "migrate" script from https://github.com/ggerganov/llama.cpp . That can convert existing "ggml" weights into the new format. |
…ation-webui into thomasantony-feature/llamacpp
Would this support using and interacting with alpaca and llama models of all sizes? |
@thomasantony I did the convertion from the base LLaMA files and that worked. This was the performance of llama-7b int4 on my i5-12400F:
|
Well, feel free to merge it! . I am glad that I was able to contribute. :). |
Thank you so much for this brilliant PR, @thomasantony! The new documentation is here: https://github.com/oobabooga/text-generation-webui/wiki/llama.cpp-models |
@thomasantony I have just noticed that the parameters are not really being used. Assigning to the Is the only way to change the model parameters to reload it from scratch like this? _model = llamacpp.LlamaInference(params) |
@oobabooga That is a side effect of how the underlying Python bindings works right now. Adding support for changing those parameters when sampling from the logits is on my ToDo list. Right now, it is only possible if you use the LlamaContext class (which is more low level), instead of the higher-level LlamaInference which currently does not allow changing the parameters post-initialization. This is probably the next thing I will update in the library. I will post back here once that is done, or make a separate PR with the changes. |
is it possible to use vram and ram like this? https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model |
@niizam 4-bit quantized models are already supported. You just need to use the appropriate weight files. |
My proof of concept for adding support for llama.cpp. It requires my experimental Python bindings (v0.1.9 and up). This has no dependencies (with some caveats mentioned below).
This is what is in
models/llamacpp-7B
right nowOnly the
ggml-model-q4_0.bin
file is required right now (and it is hardcoded). The bigger models should also work as long as the folder names start withllamacpp-
oralpaca-cpp-
.The model files can be created from the PyTorch model using the
llamacpp-convert
andllamacpp-quantize
commands that are installed along with thellamacpp
. Using these commands requires thattorch
andsentencepiece
be installed as well.There is currently no option to update the parameters like top_p, top_k etc. other than hardcoding it in
llamacpp_model.py
. This is on my todo list of things to fix.