Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4bit version of gpt4all-alpaca-oa-codealpaca-Lora-13b? #1037

Closed
sussyboiiii opened this issue Apr 18, 2023 · 22 comments
Closed

4bit version of gpt4all-alpaca-oa-codealpaca-Lora-13b? #1037

sussyboiiii opened this issue Apr 18, 2023 · 22 comments

Comments

@sussyboiiii
Copy link

Hello,
to reduce my brain usage even more I thought i'd be nice to run AI which is specifically trained to code and thus hopefully make better code than other language models which are trained for e.g. natural language.

So I found this: https://huggingface.co/jordiclive/gpt4all-alpaca-oa-codealpaca-lora-13b

I of course wanted to try and run it but there's a problem, there aren't even any pytorch_model files or any 4bit variants listed here: https://github.com/underlines/awesome-marketing-datascience/blob/master/awesome-ai.md

Thank your for your support!

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 18, 2023

llama.cpp can now load LoRA adapters, you need to convert the LoRA model to ggml using convert-lora-to-ggml.py, then load the original LLaMA 13b as the model and your LoRA model on top of it when launching ./main -m llama-13b.bin --lora lora-model.bin. Something like that.

@execveat
Copy link

--lora partially addresses the question, but the https://huggingface.co/jordiclive/gpt4all-alpaca-oa-codealpaca-lora-13b also mentions a few embeddings that are needed to support custom tokens they use:

<|prompter|>What is a meme, and what's the history behind this word?</s><|assistant|>

Is there a way to work around this with existing llama.cpp options or would it require a PR?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 18, 2023

You're right --lora doesn't support extending the tokenizer yet. In that case the model should be saved in the .pth checkpoint and converted from that. llama.cpp itself can load the tokens from the model file.

@NoNamedCat
Copy link

Can anyone convert this to load this model? I'm particularly interested in this topic of using this models to write and work with code

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 18, 2023

I created this script to merge the models: https://gist.github.com/SlyEcho/477554916bfc1a9e338240eee6396fbd

It creates a HF checkpoint that can be converted using convert.py to ggml f16 format and then later to q4_0 with quantize.

However, I'm not sure that the extra tokens are being used for tokenization.

EDIT:

It seems to work even with the text versions of <|prompter|> <|assistant|>...

@sussyboiiii
Copy link
Author

sussyboiiii commented Apr 18, 2023

So by converting the files with the ggml python script we can use gpt4all-alpaca-oa-codealpaca-Lora-13b but not as one file. But your script @SlyEcho can do that?

Edit:
For the llama I have only got the consolidated.00.pth and consolidated.01.pth

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 18, 2023

The script should download 13b from huggingface.co/decapoda-research/llama-13b-hf automatically.

I also tried the --lora adapter and it technically works, but the tokens don't work and it is slower.

@sussyboiiii
Copy link
Author

Thank you,
your script worked and I now have the .bin shards of the 13b model merged, the thing to do now it to get it to f16 and to 4bit, but which convert.py script do you use? There are different ones.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

convert.py from the master branch of this repo can handle HF format models now. You can specify the output format, but for some reason it didn't let me use q4_0, so I used f16 and then I ran ./quantize on it to get it down to q4_0

@NoNamedCat
Copy link

Can anyone upoad the bin file of this model for using it on llama.cpp?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

I could but I there is no point because it doesn't work well.

@NoNamedCat
Copy link

Tks anyway :)

@sussyboiiii
Copy link
Author

Where can i find this? I can only find the conversion scripts for gpt4all etc.?
Thanks> convert.py from the master branch of this repo can handle HF format models now. You can specify the output format, but for some reason it didn't let me use q4_0, so I used f16 and then I ran ./quantize on it to get it down to q4_0

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

This one: convert.py

edit: if you are seeing gpt4all conversion scripts, then you may need to do a git pull

@sussyboiiii
Copy link
Author

sussyboiiii commented Apr 19, 2023

Thank you, don't know how I didn't see that.

@sussyboiiii
Copy link
Author

sussyboiiii commented Apr 19, 2023

I created this script to merge the models: https://gist.github.com/SlyEcho/477554916bfc1a9e338240eee6396fbd

It creates a HF checkpoint that can be converted using convert.py to ggml f16 format and then later to q4_0 with quantize.

However, I'm not sure that the extra tokens are being used for tokenization.

EDIT:

It seems to work even with the text versions of <|prompter|> <|assistant|>...

I have gotten a vocab size mismatch, how can I fix that?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

You need to use the vocab files from the jordiclive/gpt4all-alpaca-oa-codealpaca-lora-13b repo. convert.py can also read vocab files from another directory, so you can point it to whereever HF downloader wrote the files on your disk, or just download them.

But there are some weird things going on, there are embeddings for 16 new tokens in there, but the JSON only specifies 5. My script also cuts it down to 5 but you may want to hack on this because I don't understand how it's supposed to work.

@sussyboiiii
Copy link
Author

I have forgotten to put the added_tokens.json in the directory.
Thanks it worked now!

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

If you change main.cpp around line 173 to this it should use the tokens for -ins mode

    // prefix & suffix for instruct mode
    const auto inp_pfx = std::vector<llama_token> { 32002 }; // <|prompter|>
    const auto inp_sfx = std::vector<llama_token> { 32004 }; // <|assistant|>

edit: I think the </s> or EOS token is not needed, after all. without it it works better.

@sussyboiiii
Copy link
Author

sussyboiiii commented Apr 19, 2023

The output I get is also a bit weird, it doesn't want to write code. It wanted me to visit a GitHub repo which doesn't exist.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

I can recommend other good models that are not LoRA:

These can be converted directly with convert.py and used with the instruct mode since they use the same Alpaca prompts.

@sussyboiiii
Copy link
Author

I believe this has been answered!

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
* OSX attempt 1

* OSX Pyinstaller

* Update kcpp-build-release-osx.yaml

* Update kcpp-build-release-osx.yaml

* Update kcpp-build-release-osx.yaml

* Add .metal file

* Update kcpp-build-release-osx.yaml

* Polish Mac
jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
* OSX attempt 1

* OSX Pyinstaller

* Update kcpp-build-release-osx.yaml

* Update kcpp-build-release-osx.yaml

* Update kcpp-build-release-osx.yaml

* Add .metal file

* Update kcpp-build-release-osx.yaml

* Polish Mac

(cherry picked from commit 52cc0da)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants