-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERY BIG performance improvement and beautiful features #521
Conversation
DanielusG
commented
May 29, 2023
•
edited
Loading
edited
- Fixed an issue that made the evaluation of the user input prompt extremely slow, this brought a monstrous increase in performance, about 5-6 times faster.
- Added a script to install CUDA-accelerated requirements
- Added the OpenAI model (it may go outside the scope of this repository, so I can remove it if necessary)
- Added some additional flags in the .env
- Changed the embedder template to a better performing template
- Bumped some versione, like llama-cpp-python and langchain (WARNING with the new version of llama.cpp, models are more powerful, but old models are now completely incompatible, if necessary, you can downgrade)
- I removed the state-of-union example file because someone who does not see it might leave it there and it would bother them during queries
- Added auto translation (Perhaps it should be removed because it uses an Internet connection)
- Added OpenAI llm - Added gpu layer offload for llama.cpp
- This prevents the answer from being buried by the document references
- Added a script for install cuda acceleration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty good.
How does adding n_gpu_layers and use_mlock helps in performance?
|
# Print the relevant sources used for the answer | ||
for document in docs: | ||
print("\n> " + document.metadata["source"] + ":") | ||
print(document.page_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe adding some nice colors? Also using deep-translator==1.11.1
with the env flag if someone want's to translate the answer?
translate_src = os.environ.get('TRANSLATE_SRC_LANG', "en")
translate_dst = os.environ.get('TRANSLATE_DST_LANG', "fr")
for document in docs:
print(f"\n\033[31m Source: {document.metadata['source']} \033[0m")
if translate_ans:
document.page_content = GoogleTranslator(source=translate_src, target=translate_dst).translate(document.page_content)
print(f"\033[32m\033[2m : {document.page_content} \033[0m")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was already thinking about it. I was going to implement it as soon as I had some free time :)
@@ -0,0 +1,7 @@ | |||
export LLAMA_CUBLAS=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in your case :
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
are not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that flag are for windows. this script work with linux. I already tested it :)
Later i will check for windows
MODEL_PATH=/path/for/model | ||
#best english embeddings model | ||
#best italian efederici/sentence-it5-base | ||
EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this one uses 768
dimensionalities and might not work with most models like ggml-vic13b-q5_1.bin
or koala-7B.ggmlv3.q8_0.bin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also concerned about this. But it seems to work well in my tests. I am using it to take formulas from my physics book while studying hahaha.
Sometimes he gets it wrong, but I'm not sure it depends on this
if the user have a Nvidia GPU, part of the model will be offloaded on gpu, and it accelerate things. mlock prevent disk read, so more performance. I added them as options in the .env precisely to let the user choose whether he wants these improvements or not |
privateGPT.py
Outdated
case "GPT4All": | ||
llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False) | ||
case "OpenAI": | ||
llm = OpenAI(model=model_path,openai_api_key=os.environ.get('OPENAI_API_KEY'),streaming=True,callbacks=callbacks, verbose=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add checks if the key is present in the .env file?
if not os.environ.get('OPENAI_API_KEY'):
print("Add your OPENAI_API_KEY to .env variable. Get your OpenAI API key from [here](https://platform.openai.com/account/api-keys).\n")
exit(0)
Also, you'll have to check if the key is valid by wrapping the q / a section:
try:
res = qa(query)
answer, docs = res['result'], [] if args.hide_source else res['source_documents']
# .....
# rest of the code...
except AuthenticationError as e:
print(f"Warning: An exception occurred. Your OPENAI_API_KEY is invalid: {e.error}")
Thanks for the contribution! Looking forward to merging this! Some comments/questions:
Thanks again! |
I had also opened an issue about it #493 For those who don't know, n_batch indicates how many tokens at a time are to be processed by the llama context, being so low, with a context of 1000 tokens, a for loop of 1000/8 times was therefore executed, which is extremely slow and heavy. In fact, the default value used in the main repository of llama.cpp is 512. After some experimentation, I noticed that 1024 seems a good value.
Yes, this is perfectly fine, in fact I even wrote it above. I simply left it as I was having a bad performance with llama 13b and wanted to test whether GPT 3.5 was better. In the end I solved it by using vicuna 13b 1.1 q5_1
Yes, i think too, if you want i can do it, but my english is not so good hahah |
@imartinez as I understand that many people may not have sufficient computing power to run this code, if you want I can create a new branch in my fork and leave openAI active. What do you think? |
- The intergration of OpenAI does not reflect the purpose of this project, so it will be removed
Added automatic translation to the prompt
Fix a typo
Update README.md
for reference, after running the requirements, I still had to install the following (on clean environment):
the last resulted in: nvcc fatal : Value 'native' is not defined for option 'gpu-architecture' running I5/32G RAM/Nvidia Titan 12GB VRAM. nvcc --list-gpu-arch: compute_35 |
Did you use the bash script? If so, before starting the script you must execute: |
Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. |
It appears that the way in which llama.cpp loads and processes the model on M1 processors is different than on normal processors. So privateGPT without this pull request should still work fine for you. Unfortunately I don't have a MacBook available to give you any information, sorry |
Any talks to run cuda inside docker? heard somewhere that it's possible:
Once it works for you, you can
(changed path to dir with models, clearly) https://docs.docker.com/config/containers/resource_constraints/ |
Hi @DanielusG , I'm interested if you are of keeping an OpenAI branch to try out. Would need the readme updated on that branch to point out the MODEL_TYPE=OpenAI and anything else. Thanks. |
This is definitely possible. I've used the tech you mention to deploy instant-ngp in a restricted environment that ran an older OS. Performance was great, and forwarding the UI out of the container was also possible if you have a need for a GUI. Interestingly, it's also used by cog to streamline the deployment of ML models via docker containers. It attempts to make the packaging of the dependencies less of a headache. Not sure if that system would suit the needs of this repo, but could be worth a look as well. |
|
||
auto_translate = os.environ.get("AUTO_TRANSLATE") | ||
|
||
def translate(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd remove the translation feature from this PR so that the great core improvements can be viewed separately since this feature might be a bit controversial (requires internet, uses Google Translate,..).
If you decide to leave them in though it would be awesome if you could mention it in the README and add AUTO_TRANSLATE
to the example.env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest all the changes I have made to in this PR are changes I had to make to make privateGPT functional for me. And so I thought that just as they are useful for me they might be useful for someone else. In any case, actually the translation function as useful as it is, it totally goes against the purpose of this project, so yes, I will remove it.
I was thinking, however, of having the model locally translate the prompt
For example asking Vicuna "If this text is not English, translate it into English." After Vicuna does the translation you use Vicuna's response to execute the prompt. This would ensure privacy
I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging! For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc. |
I am not yet good with github, what would you suggest I do? close this PR and open several with individual features? |
@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository To create branches is super easy: After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME. Commit, push, create PR |
OK clear, I will try to do something as soon as I finish work :) |
yeah tried this fork yesterday .. 5x times slower on ingest and query -- looks like lots of for nada |
I'm sorry but nothing has been changed in the ingest, so if you strangely find that this fork is slower in the ingest, you are obviously misconfiguring something :) It would also be useful to understand on which platform/drivers/hardware you ran this fork and maybe your comment will be more useful... thanks :D |
@DanielusG Hi! I've implemented the same way OpenAI and KoboldAPI endpoints, and also an advanced translation stuff (Google Translate and OneRingTranslator (for 7 different translators, even offline)). Feel free to use my code. |
@janvarev Hi! actually the goal of my modifications was performance increase, then I found myself making those improvements like translation. I assume you implemented them better than I did, I just coded them on the fly Thank you for your work! |
@DanielusG On my GPU I notice a very fast loading of LLM, like 1 second but waiting for the database for similarity search took more 52 seconds (+ GoogleTransla) + searching for the answer k = 4, no MLOCK used, on ingest of 104 pdf/epub files. The model I used is WizardLM-7B-uncensored.ggmlv3.q8_0.bin and you must know my setup:
|
i think with your setup you can execute the model without using llama.cpp but loading a 7b model with your GPU. Actually this project doesn't provide a way for load a model in that way but i can try to do something in my fork. the slowness you are experiencing is definitely due to the fact that llama.cpp must first process the context and then write the output. the search should take very little time, that is not the problem. Are you sure you installed torch with CUDA enabled? If you have not installed toch with CUDA, vector search will be done with CPU and it is VERY slow, with my GPU (RTX 2060) for a search of 10 documents takes about 2-3 seconds |
Can you please show me how to install it, do I need to add something in the requirements.txt? I used conda env and installed there, python 3.10. I saw you added some explanation to the readme for windows but I don't understand what WIP is? I guess I should follow https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html or just run this? pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
this is the right command, but first you have to uninstall old torch installation by typing Now I was trying and it really seems that llama.cpp doesn't work on windows with GPU, only with CPU, at most you should try with a linux distro remember that installing torch with CUDA, will only speed up the vector search not the writing by llama.cpp. But if that's your hardware I don't think llama.cpp should run slowly. Surely, however, you will unlock the full capabilities on a linux system |
@DanielusG yea because llama.cpp is specifically designed for CPU. I'll try with dual boot Debian, installing all of this has some differences, after a few seconds here and there, I can see the coolers on my graphic card start yelling :D Will try with other models too |
@sime2408 My philosophy is, you paid for the whole GPU, use the whole GPU, so go... make it yell because that's what it was born for! (If you want you can keep me updated by contacting me on discord: DanielusG26#2745) |
sent you a request on Discord :) thanks, man, I am still struggling to make it work that speed as you did. I am orbita24#3506 there. Actually, let's have our own room so other people can join too: I'll show you what I have: By the way this is my repo: https://github.com/sime2408/scrapalot-chat |
Hi Jason,
can't say i can remember now, my memlock limit is now 64M where it used to
be 4M, but for the life of me I can't find where I set it. my current
limits.conf is the default.
…On Fri, Jun 9, 2023 at 3:23 AM Jasen Mackie ***@***.***> wrote:
ok, installing the latest nvidia toolkit (12.1) has allowed
llama-cpp-python to build correctly, seem the ubuntu packages are somewhat
out of date.
also had to edit /etc/security/limits.conf to raise the memlock limit.
@darrinh <https://github.com/darrinh> i am unable to edit
/etc/security/limits.conf to unlimited. Is that the value you used, and did
you save it for the root user? Can you share the contents of your
limits.conf file? Thank you!
—
Reply to this email directly, view it on GitHub
<#521 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADOCT3ZIW5FZ7A75EWKFIDXKIDAJANCNFSM6AAAAAAYSMBGDM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@imartinez Some research I found together with @DanielusG indicates we can merge this story let's say to a separate branch or we must set the flag to turn on / of GPU acceleration. With the help of @DanielusG I manage to run it on my Windows PC, didn't test it on Linux and MacOS but I gather some experiences I can list here. Note this worked for llama-cpp-python==0.1.61 GPU accelerationMost importantly the embeddings_kwargs = {'device': 'cuda'} if gpu_is_enabled else {} GPU (Windows)Set pip3 install -r requirements_windows.txt Install Visual Studio 2019 - 2022 Code C++ compiler on Windows 10/11:
You can use the included installer batch file to install the required dependencies for GPU acceleration, or:
If the above doesn't work for you, you will have to manually build llama-cpp-python library with CMake:
Next is somewhat @DanielusG already tested on Linux I guess: GPU (Linux):Set If you have an Nvidia GPU, you can speed things up by installing the (some libraries might be different per OS, that's why I separated requirements files) pip3 install -r requirements_linux.txt First, you have to uninstall the old torch installation and install CUDA one: pip3 uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 Now, set environment variables and source them: vim ~/.bashrc export LLAMA_CUBLAS=1
export LLAMA_CLBLAST=1
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1 source ~/.bashrc LLAMAllama.cpp doesn't work easily on Windows with GPU,, so you should try with a Linux distro You should install the latest Cuda toolkit: conda install -c conda-forge cudatoolkitpip uninstall llama-cpp-python if you're already in conda env you can uninstall llama-cpp-python like this: pip3 uninstall llama-cpp-python Install llama: CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.61 --no-cache-dir Modify LLM code to accept llm = LlamaCpp(model_path=model_path, ..., n_gpu_layers=20) Change environment variable model: MODEL_TYPE=llamacpp
MODEL_ID_OR_PATH=models/ggml-vic13b-q5_1.bin |