VERY BIG performance improvement and beautiful features #521

DanielusG · 2023-05-29T06:06:40Z

Fixed an issue that made the evaluation of the user input prompt extremely slow, this brought a monstrous increase in performance, about 5-6 times faster.
Added a script to install CUDA-accelerated requirements
Added the OpenAI model (it may go outside the scope of this repository, so I can remove it if necessary)
Added some additional flags in the .env
Changed the embedder template to a better performing template
Bumped some versione, like llama-cpp-python and langchain (WARNING with the new version of llama.cpp, models are more powerful, but old models are now completely incompatible, if necessary, you can downgrade)
I removed the state-of-union example file because someone who does not see it might leave it there and it would bother them during queries
Added auto translation (Perhaps it should be removed because it uses an Internet connection)

- Added OpenAI llm - Added gpu layer offload for llama.cpp

- This prevents the answer from being buried by the document references

- Added a script for install cuda acceleration

StolasMartin

This seems pretty good.

8bitaby · 2023-05-29T06:48:50Z

How does adding n_gpu_layers and use_mlock helps in performance?

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)

sime2408 · 2023-05-29T07:21:45Z

privateGPT.py

+        # Print the relevant sources used for the answer
+        for document in docs:
+            print("\n> " + document.metadata["source"] + ":")
+            print(document.page_content)


Maybe adding some nice colors? Also using deep-translator==1.11.1 with the env flag if someone want's to translate the answer?

translate_src = os.environ.get('TRANSLATE_SRC_LANG', "en") translate_dst = os.environ.get('TRANSLATE_DST_LANG', "fr")

for document in docs: print(f"\n\033[31m Source: {document.metadata['source']} \033[0m") if translate_ans: document.page_content = GoogleTranslator(source=translate_src, target=translate_dst).translate(document.page_content) print(f"\033[32m\033[2m : {document.page_content} \033[0m")

Yes, I was already thinking about it. I was going to implement it as soon as I had some free time :)

sime2408 · 2023-05-29T07:25:19Z

install_cuda.sh

@@ -0,0 +1,7 @@
+export LLAMA_CUBLAS=1


so in your case :

set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1

are not needed?

I think that flag are for windows. this script work with linux. I already tested it :)
Later i will check for windows

sime2408 · 2023-05-29T07:44:44Z

example.env

+MODEL_PATH=/path/for/model
+#best english embeddings model
+#best italian efederici/sentence-it5-base
+EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2


I think that this one uses 768 dimensionalities and might not work with most models like ggml-vic13b-q5_1.bin or koala-7B.ggmlv3.q8_0.bin.

I was also concerned about this. But it seems to work well in my tests. I am using it to take formulas from my physics book while studying hahaha.
Sometimes he gets it wrong, but I'm not sure it depends on this

DanielusG · 2023-05-29T08:46:07Z

How does adding n_gpu_layers and use_mlock helps in performance?

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)

if the user have a Nvidia GPU, part of the model will be offloaded on gpu, and it accelerate things. mlock prevent disk read, so more performance. I added them as options in the .env precisely to let the user choose whether he wants these improvements or not

sime2408 · 2023-05-29T08:58:24Z

privateGPT.py

        case "GPT4All":
            llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
+        case "OpenAI":
+            llm = OpenAI(model=model_path,openai_api_key=os.environ.get('OPENAI_API_KEY'),streaming=True,callbacks=callbacks, verbose=False)


Maybe add checks if the key is present in the .env file?

if not os.environ.get('OPENAI_API_KEY'): print("Add your OPENAI_API_KEY to .env variable. Get your OpenAI API key from [here](https://platform.openai.com/account/api-keys).\n") exit(0)

Also, you'll have to check if the key is valid by wrapping the q / a section:

try: res = qa(query) answer, docs = res['result'], [] if args.hide_source else res['source_documents'] # ..... # rest of the code... except AuthenticationError as e: print(f"Warning: An exception occurred. Your OPENAI_API_KEY is invalid: {e.error}")

requirements.txt

imartinez · 2023-05-29T10:03:11Z

Thanks for the contribution! Looking forward to merging this! Some comments/questions:

Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?
I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.
I guess some updates to the readme would be required.

Thanks again!

DanielusG · 2023-05-29T12:17:40Z

Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?

I had also opened an issue about it #493
Basically, after a while of debugging, I ran the code in debug mode and analysed line by line how far the programme would get stuck for a long time, and after quite a while I discovered that when the langchain library made the call to LlamaCpp, it passed it the parameter n_batch = 8

For those who don't know, n_batch indicates how many tokens at a time are to be processed by the llama context, being so low, with a context of 1000 tokens, a for loop of 1000/8 times was therefore executed, which is extremely slow and heavy. In fact, the default value used in the main repository of llama.cpp is 512. After some experimentation, I noticed that 1024 seems a good value.

I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.

Yes, this is perfectly fine, in fact I even wrote it above. I simply left it as I was having a bad performance with llama 13b and wanted to test whether GPT 3.5 was better.

In the end I solved it by using vicuna 13b 1.1 q5_1

I guess some updates to the readme would be required.

Yes, i think too, if you want i can do it, but my english is not so good hahah

DanielusG · 2023-05-29T12:20:22Z

@imartinez as I understand that many people may not have sufficient computing power to run this code, if you want I can create a new branch in my fork and leave openAI active. What do you think?

- The intergration of OpenAI does not reflect the purpose of this project, so it will be removed

Added automatic translation to the prompt

README.md

Fix a typo

Update README.md

darrinh · 2023-05-30T07:47:55Z

for reference, after running the requirements, I still had to install the following (on clean environment):

python -m pip install python-dotenv
pip install tqdm
pip install langchain
pip install chromadb
pip install sentence_transformers
pip install pip install sentence_transformers
pip install llama-cpp-python

the last resulted in:

nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

running I5/32G RAM/Nvidia Titan 12GB VRAM.

nvcc --list-gpu-arch:

compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87

DanielusG · 2023-05-30T08:14:17Z

for reference, after running the requirements, I still had to install the following (on clean environment):

python -m pip install python-dotenv

pip install tqdm

pip install langchain

pip install chromadb

pip install sentence_transformers

pip install pip install sentence_transformers

pip install llama-cpp-python

the last resulted in:

nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

running I5/32G RAM/Nvidia Titan 12GB VRAM.

nvcc --list-gpu-arch:

compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87

Did you use the bash script?

If so, before starting the script you must execute:
source ./venv/bin/activate.sh
for activate the local enviroment.

marzim · 2023-06-01T12:46:30Z

Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G.
Question: what would be the configuration in running this on my mac M1 with only have 8G?

DanielusG · 2023-06-01T14:42:47Z

Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. Question: what would be the configuration in running this on my mac M1 with only have 8G?

It appears that the way in which llama.cpp loads and processes the model on M1 processors is different than on normal processors. So privateGPT without this pull request should still work fine for you. Unfortunately I don't have a MacBook available to give you any information, sorry

sime2408 · 2023-06-02T08:43:59Z

Any talks to run cuda inside docker? heard somewhere that it's possible:

Nvidia CUDA in a Docker container:
1. run nvidia-smi on host, it needs to run successfully
2. install nvidia-container-toolkit
3. restart Docker process
4. run a test container like so

docker run --gpus all nvidia/cuda:12.1.1-base-ubi8 nvidia-smi

Should be same as 1.

Once it works for you, you can

pull this PR (not merged, but working): docker: add support for CUDA in docker ggml-org/llama.cpp#1461
download a model from here, I tried with the smallest one:
https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main
to create a Docker image locally, there is a description of how to do it in PR
start the process like this
docker run --rm --gpus all -v ~/Development/LLM/Models/:/models local/llama.cpp:light-cuda -m /models/llama-7b-ggml-v2-q4_0.bin -p " Here's a haiku about a rotten banana" -n 512 --n-gpu-layers 1

(changed path to dir with models, clearly)

https://docs.docker.com/config/containers/resource_constraints/

ACoderLife · 2023-06-04T04:23:24Z

Hi @DanielusG , I'm interested if you are of keeping an OpenAI branch to try out. Would need the readme updated on that branch to point out the MODEL_TYPE=OpenAI and anything else. Thanks.

JasonGilholme · 2023-06-04T21:54:58Z

@sime2408

Any talks to run cuda inside docker? heard somewhere that it's possible

This is definitely possible. I've used the tech you mention to deploy instant-ngp in a restricted environment that ran an older OS. Performance was great, and forwarding the UI out of the container was also possible if you have a need for a GUI.

Interestingly, it's also used by cog to streamline the deployment of ML models via docker containers. It attempts to make the packaging of the dependencies less of a headache. Not sure if that system would suit the needs of this repo, but could be worth a look as well.

andreascful · 2023-06-05T09:22:28Z

check_lang.py

+
+auto_translate = os.environ.get("AUTO_TRANSLATE")
+
+def translate(text):


Personally I'd remove the translation feature from this PR so that the great core improvements can be viewed separately since this feature might be a bit controversial (requires internet, uses Google Translate,..).
If you decide to leave them in though it would be awesome if you could mention it in the README and add AUTO_TRANSLATE to the example.env

To be honest all the changes I have made to in this PR are changes I had to make to make privateGPT functional for me. And so I thought that just as they are useful for me they might be useful for someone else. In any case, actually the translation function as useful as it is, it totally goes against the purpose of this project, so yes, I will remove it.
I was thinking, however, of having the model locally translate the prompt

For example asking Vicuna "If this text is not English, translate it into English." After Vicuna does the translation you use Vicuna's response to execute the prompt. This would ensure privacy

riverar · 2023-06-06T05:16:54Z

I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!

For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.

DanielusG · 2023-06-06T05:59:00Z

I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!

For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.

I am not yet good with github, what would you suggest I do? close this PR and open several with individual features?
How do I remove merge changes in my master branch?
Thanks for your patience 🥲

sime2408 · 2023-06-06T06:11:30Z

@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository

To create branches is super easy:

After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME.

Commit, push, create PR

DanielusG · 2023-06-06T06:21:40Z

@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository

To create branches is super easy:

After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME.

Commit, push, create PR

OK clear, I will try to do something as soon as I finish work :)

quantumalchemy · 2023-06-06T14:51:19Z

yeah tried this fork yesterday .. 5x times slower on ingest and query -- looks like lots of for nada

DanielusG · 2023-06-06T15:02:39Z

yeah tried this fork yesterday .. 5x times slower on ingest and query -- looks like lots of for nada

I'm sorry but nothing has been changed in the ingest, so if you strangely find that this fork is slower in the ingest, you are obviously misconfiguring something :)

It would also be useful to understand on which platform/drivers/hardware you ran this fork and maybe your comment will be more useful... thanks :D

DanielusG · 2023-06-06T21:16:34Z

@riverar @sime2408 I have created a new PR with only the performance changes let me know what you think (#649)

janvarev · 2023-06-07T08:49:46Z

@DanielusG Hi!
May be you interested in my PR too #325

I've implemented the same way OpenAI and KoboldAPI endpoints, and also an advanced translation stuff (Google Translate and OneRingTranslator (for 7 different translators, even offline)). Feel free to use my code.

DanielusG · 2023-06-07T08:53:50Z

@DanielusG Hi! May be you interested in my PR too #325

I've implemented the same way OpenAI and KoboldAPI endpoints, and also an advanced translation stuff (Google Translate and OneRingTranslator (for 7 different translators, even offline)). Feel free to use my code.

@janvarev Hi! actually the goal of my modifications was performance increase, then I found myself making those improvements like translation. I assume you implemented them better than I did, I just coded them on the fly

Thank you for your work!

sime2408 · 2023-06-07T12:17:19Z

@riverar @sime2408 I have created a new PR with only the performance changes let me know what you think (#649)

@DanielusG On my GPU I notice a very fast loading of LLM, like 1 second but waiting for the database for similarity search took more 52 seconds (+ GoogleTransla) + searching for the answer k = 4, no MLOCK used, on ingest of 104 pdf/epub files. The model I used is WizardLM-7B-uncensored.ggmlv3.q8_0.bin and you must know my setup:

MSI nVidia GeForce RTX4090 Ventus 3X OC, 24 GB GDDR6X, DLSS 3
motherboard: Asus ROG STRIX B660-G Gaming WIFI, socket 1700, DDR5, HDMI,DP, Wi-Fi 6, BT 5.2
INTEL Core i9 12900KF
RAM KINGSTON 64 GB (2x32 GB) DDR5, 5600 MHz, DIMM, Fury Beast, CL40
cooler: ARCTIC Freezer 34 eSports DUO Black/Grey
power supply: GIGABYTE UD1000GM PG5, 1000 W, ultra-durable, PCIe Gen 5.0,
SSD disk 1 TB, SAMSUNG 870 QVO, 2.5", SATA III, 4bit MLC V-NAND, MZ-77Q1T0BW
(OS installed) SSD disk 500 GB, SAMSUNG 980, M.2 2280, PCIe 3.0 x4 NVMe, 3-bit MLC V-NAND, MZ-V8V500BW
OS: Win 11

DanielusG · 2023-06-07T12:48:04Z

@riverar @sime2408 I have created a new PR with only the performance changes let me know what you think (#649)

@DanielusG On my GPU I notice a very fast loading of LLM, like 1 second but waiting for the database for similarity search took more 52 seconds (+ GoogleTransla) + searching for the answer k = 4, no MLOCK used, on ingest of 104 pdf/epub files. The model I used is WizardLM-7B-uncensored.ggmlv3.q8_0.bin and you must know my setup:

MSI nVidia GeForce RTX4090 Ventus 3X OC, 24 GB GDDR6X, DLSS 3

motherboard: Asus ROG STRIX B660-G Gaming WIFI, socket 1700, DDR5, HDMI,DP, Wi-Fi 6, BT 5.2

INTEL Core i9 12900KF

RAM KINGSTON 64 GB (2x32 GB) DDR5, 5600 MHz, DIMM, Fury Beast, CL40

cooler: ARCTIC Freezer 34 eSports DUO Black/Grey

power supply: GIGABYTE UD1000GM PG5, 1000 W, ultra-durable, PCIe Gen 5.0,

SSD disk 1 TB, SAMSUNG 870 QVO, 2.5", SATA III, 4bit MLC V-NAND, MZ-77Q1T0BW

(OS installed) SSD disk 500 GB, SAMSUNG 980, M.2 2280, PCIe 3.0 x4 NVMe, 3-bit MLC V-NAND, MZ-V8V500BW

OS: Win 11

i think with your setup you can execute the model without using llama.cpp but loading a 7b model with your GPU. Actually this project doesn't provide a way for load a model in that way but i can try to do something in my fork.

the slowness you are experiencing is definitely due to the fact that llama.cpp must first process the context and then write the output. the search should take very little time, that is not the problem.

Are you sure you installed torch with CUDA enabled? If you have not installed toch with CUDA, vector search will be done with CPU and it is VERY slow, with my GPU (RTX 2060) for a search of 10 documents takes about 2-3 seconds

sime2408 · 2023-06-07T12:57:16Z

Are you sure you installed torch with CUDA enabled? If you have not installed toch with CUDA, vector search will be done with CPU and it is VERY slow, with my GPU (RTX 2060) for a search of 10 documents takes about 2-3 seconds

Can you please show me how to install it, do I need to add something in the requirements.txt?

I used conda env and installed there, python 3.10. I saw you added some explanation to the readme for windows but I don't understand what WIP is?

I guess I should follow https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html

or just run this? pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

DanielusG · 2023-06-07T13:26:25Z

or just run this? pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

this is the right command, but first you have to uninstall old torch installation by typing pip uninstall torch

Now I was trying and it really seems that llama.cpp doesn't work on windows with GPU, only with CPU, at most you should try with a linux distro

remember that installing torch with CUDA, will only speed up the vector search not the writing by llama.cpp. But if that's your hardware I don't think llama.cpp should run slowly. Surely, however, you will unlock the full capabilities on a linux system

sime2408 · 2023-06-07T13:46:53Z

@DanielusG yea because llama.cpp is specifically designed for CPU. I'll try with dual boot Debian, installing all of this has some differences, after a few seconds here and there, I can see the coolers on my graphic card start yelling :D Will try with other models too

DanielusG · 2023-06-08T07:08:24Z

@DanielusG yea because llama.cpp is specifically designed for CPU. I'll try with dual boot Debian, installing all of this has some differences, after a few seconds here and there, I can see the coolers on my graphic card start yelling :D Will try with other models too

@sime2408 My philosophy is, you paid for the whole GPU, use the whole GPU, so go... make it yell because that's what it was born for! (If you want you can keep me updated by contacting me on discord: DanielusG26#2745)

sime2408 · 2023-06-08T07:40:29Z

@sime2408 My philosophy is, you paid for the whole GPU, use the whole GPU, so go... make it yell because that's what it was born for! (If you want you can keep me updated by contacting me on discord: DanielusG26#2745)

sent you a request on Discord :) thanks, man, I am still struggling to make it work that speed as you did.

I am orbita24#3506 there. Actually, let's have our own room so other people can join too: I'll show you what I have:

https://discord.gg/SaYbrNse

By the way this is my repo: https://github.com/sime2408/scrapalot-chat

darrinh · 2023-06-09T00:51:37Z

Hi Jason, can't say i can remember now, my memlock limit is now 64M where it used to be 4M, but for the life of me I can't find where I set it. my current limits.conf is the default.

…

On Fri, Jun 9, 2023 at 3:23 AM Jasen Mackie ***@***.***> wrote: ok, installing the latest nvidia toolkit (12.1) has allowed llama-cpp-python to build correctly, seem the ubuntu packages are somewhat out of date. also had to edit /etc/security/limits.conf to raise the memlock limit. @darrinh <https://github.com/darrinh> i am unable to edit /etc/security/limits.conf to unlimited. Is that the value you used, and did you save it for the root user? Can you share the contents of your limits.conf file? Thank you! — Reply to this email directly, view it on GitHub <#521 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADOCT3ZIW5FZ7A75EWKFIDXKIDAJANCNFSM6AAAAAAYSMBGDM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sime2408 · 2023-06-11T09:40:20Z

@imartinez Some research I found together with @DanielusG indicates we can merge this story let's say to a separate branch or we must set the flag to turn on / of GPU acceleration. With the help of @DanielusG I manage to run it on my Windows PC, didn't test it on Linux and MacOS but I gather some experiences I can list here. Note this worked for llama-cpp-python==0.1.61

GPU acceleration

Most importantly the GPU_IS_ENABLED variable must be set to true. Add this to HuggingfaceEmbeddings:

embeddings_kwargs = {'device': 'cuda'} if gpu_is_enabled else {}

GPU (Windows)

Set OS_RUNNING_ENVIRONMENT=windows inside .env file

pip3 install -r requirements_windows.txt

Install Visual Studio 2019 - 2022 Code C++ compiler on Windows 10/11:

Install Visual Studio.
Make sure the following components are selected:
- Universal Windows Platform development
- C++ CMake tools for Windows
Download the MinGW installer from the MinGW website.
Run the installer and select the gcc component.

You can use the included installer batch file to install the required dependencies for GPU acceleration, or:

Find your card driver here NVIDIA Driver Downloads
Install NVidia CUDA 11.8

Install llama-cpp-python package with cuBLAS enabled. Run the code below in the directory you want to build the package in.

Powershell:

$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Bash:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Enable GPU acceleration in .env file by setting GPU_IS_ENABLED to true
Run ingest.py and privateGPT.py as usual

If the above doesn't work for you, you will have to manually build llama-cpp-python library with CMake:

Get repo git clone https://github.com/abetlen/llama-cpp-python.git,
- switch to tag this application is using from requirements-*.txt file:
- uninstall your local llama-cpp-python: pip3 uninstall llama-cpp-python
Open llama-cpp-python/vendor/llama.cpp/CMakeList.txt in text editor and add
set(LLAMA_CUBLAS 1) to the line 178 before if (LLAMA_CUBLAS) line.
Install CMake
Go to cd llama-cpp-python and perform the actions:
- perform git submodule update --init --recursive
- mkdir build and cd build

Build llama-cpp-python yourself:

cmake -G "Visual Studio 16 2019" -A x64 -D CUDAToolkit_ROOT="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8" ..

Position CLI to this project and install llama from the folder you build, let's say pip3 install ../llama-cpp-python/

Next is somewhat @DanielusG already tested on Linux I guess:

GPU (Linux):

Set OS_RUNNING_ENVIRONMENT=linux inside .env file

If you have an Nvidia GPU, you can speed things up by installing the llama-cpp-python version with CUDA
by setting these flags: export LLAMA_CUBLAS=1

(some libraries might be different per OS, that's why I separated requirements files)

pip3 install -r requirements_linux.txt

First, you have to uninstall the old torch installation and install CUDA one:
Install a proper torch version:

pip3 uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Now, set environment variables and source them:

vim ~/.bashrc

export LLAMA_CUBLAS=1
export LLAMA_CLBLAST=1 
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1

source ~/.bashrc

LLAMA

llama.cpp doesn't work easily on Windows with GPU,, so you should try with a Linux distro
Installing torch with CUDA will only speed up the vector search, not the writing by llama.cpp.

You should install the latest Cuda toolkit:

conda install -c conda-forge cudatoolkitpip uninstall llama-cpp-python

if you're already in conda env you can uninstall llama-cpp-python like this:

pip3 uninstall llama-cpp-python

Install llama:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.61 --no-cache-dir

Modify LLM code to accept n_gpu_layers:

llm = LlamaCpp(model_path=model_path, ..., n_gpu_layers=20)

Change environment variable model:

MODEL_TYPE=llamacpp
MODEL_ID_OR_PATH=models/ggml-vic13b-q5_1.bin

DanielusG added 6 commits May 29, 2023 07:33

Add source_docuemnts to gitignore

b482d8f

- Removed example file

0fe977f

Fix slow generation

c2a9235

- Added OpenAI llm - Added gpu layer offload for llama.cpp

Response order adjusted

7be527c

- This prevents the answer from being buried by the document references

Bump llama-cpp-python version

c1939db

Improved .env example and version bumps

afe8abb

- Added a script for install cuda acceleration

StolasMartin previously approved these changes May 29, 2023

View reviewed changes

sime2408 reviewed May 29, 2023

View reviewed changes

requirements.txt Show resolved Hide resolved

Removed OpenAI

dee2dba

- The intergration of OpenAI does not reflect the purpose of this project, so it will be removed

DanielusG dismissed StolasMartin’s stale review via dee2dba May 29, 2023 12:26

DanielusG added 6 commits May 29, 2023 14:31

Dynamic N_batch at user's choice

562a718

Update README.md

3809682

Fix example env

f992b9c

update .gitignore

f7aa817

Added automatic translation to the prompt

d9e88d3

Merge pull request #1 from DanielusG/automatic-translate

1f2e39c

Added automatic translation to the prompt

JasonGilholme reviewed May 30, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

DanielusG added 2 commits May 30, 2023 07:50

Update README.md

4c290d2

Fix a typo

Merge pull request #2 from DanielusG/DanielusG-patch-1

02dc09b

Update README.md

maozdemir mentioned this pull request Jun 2, 2023

feat: Enable GPU acceleration #425

Closed

Merge branch 'imartinez:main' into main

a0d565c

andreascful suggested changes Jun 5, 2023

View reviewed changes

DanielusG mentioned this pull request Jun 6, 2023

Fix slow gen #649

Closed

Kerem0437 approved these changes Aug 22, 2023

View reviewed changes

imartinez added the primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT label Oct 19, 2023

imartinez closed this Dec 4, 2023


		auto_translate = os.environ.get("AUTO_TRANSLATE")

		def translate(text):

VERY BIG performance improvement and beautiful features #521

VERY BIG performance improvement and beautiful features #521

Conversation

DanielusG commented May 29, 2023 • edited Loading

StolasMartin left a comment

Choose a reason for hiding this comment

8bitaby commented May 29, 2023

sime2408 May 29, 2023 • edited Loading

Choose a reason for hiding this comment

DanielusG May 29, 2023

Choose a reason for hiding this comment

sime2408 May 29, 2023

Choose a reason for hiding this comment

DanielusG May 29, 2023

Choose a reason for hiding this comment

sime2408 May 29, 2023

Choose a reason for hiding this comment

DanielusG May 29, 2023

Choose a reason for hiding this comment

DanielusG commented May 29, 2023

sime2408 May 29, 2023 • edited Loading

Choose a reason for hiding this comment

imartinez commented May 29, 2023

DanielusG commented May 29, 2023

DanielusG commented May 29, 2023

darrinh commented May 30, 2023 • edited Loading

DanielusG commented May 30, 2023

marzim commented Jun 1, 2023

DanielusG commented Jun 1, 2023

sime2408 commented Jun 2, 2023 • edited Loading

ACoderLife commented Jun 4, 2023

JasonGilholme commented Jun 4, 2023

andreascful Jun 5, 2023

Choose a reason for hiding this comment

DanielusG Jun 5, 2023

Choose a reason for hiding this comment

riverar commented Jun 6, 2023 • edited Loading

DanielusG commented Jun 6, 2023

sime2408 commented Jun 6, 2023 • edited Loading

DanielusG commented Jun 6, 2023

quantumalchemy commented Jun 6, 2023

DanielusG commented Jun 6, 2023

DanielusG commented Jun 6, 2023

janvarev commented Jun 7, 2023

DanielusG commented Jun 7, 2023

sime2408 commented Jun 7, 2023 • edited Loading

DanielusG commented Jun 7, 2023 • edited Loading

sime2408 commented Jun 7, 2023 • edited Loading

DanielusG commented Jun 7, 2023

sime2408 commented Jun 7, 2023

DanielusG commented Jun 8, 2023

sime2408 commented Jun 8, 2023 • edited Loading

darrinh commented Jun 9, 2023 via email

sime2408 commented Jun 11, 2023 • edited Loading

GPU acceleration

GPU (Windows)

GPU (Linux):

LLAMA

DanielusG commented May 29, 2023 •

edited

Loading

sime2408 May 29, 2023 •

edited

Loading

sime2408 May 29, 2023 •

edited

Loading

darrinh commented May 30, 2023 •

edited

Loading

sime2408 commented Jun 2, 2023 •

edited

Loading

riverar commented Jun 6, 2023 •

edited

Loading

sime2408 commented Jun 6, 2023 •

edited

Loading

sime2408 commented Jun 7, 2023 •

edited

Loading

DanielusG commented Jun 7, 2023 •

edited

Loading

sime2408 commented Jun 7, 2023 •

edited

Loading

sime2408 commented Jun 8, 2023 •

edited

Loading

sime2408 commented Jun 11, 2023 •

edited

Loading