Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERY BIG performance improvement and beautiful features #521

Closed
wants to merge 16 commits into from

Conversation

DanielusG
Copy link

@DanielusG DanielusG commented May 29, 2023

  • Fixed an issue that made the evaluation of the user input prompt extremely slow, this brought a monstrous increase in performance, about 5-6 times faster.
  • Added a script to install CUDA-accelerated requirements
  • Added the OpenAI model (it may go outside the scope of this repository, so I can remove it if necessary)
  • Added some additional flags in the .env
  • Changed the embedder template to a better performing template
  • Bumped some versione, like llama-cpp-python and langchain (WARNING with the new version of llama.cpp, models are more powerful, but old models are now completely incompatible, if necessary, you can downgrade)
  • I removed the state-of-union example file because someone who does not see it might leave it there and it would bother them during queries
  • Added auto translation (Perhaps it should be removed because it uses an Internet connection)

- Added OpenAI llm
- Added gpu layer offload for llama.cpp
- This prevents the answer from being buried by the document references
- Added a script for install cuda acceleration
StolasMartin
StolasMartin previously approved these changes May 29, 2023
Copy link

@StolasMartin StolasMartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty good.

@8bitaby
Copy link

8bitaby commented May 29, 2023

How does adding n_gpu_layers and use_mlock helps in performance?

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)

# Print the relevant sources used for the answer
for document in docs:
print("\n> " + document.metadata["source"] + ":")
print(document.page_content)
Copy link

@sime2408 sime2408 May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding some nice colors? Also using deep-translator==1.11.1 with the env flag if someone want's to translate the answer?

translate_src = os.environ.get('TRANSLATE_SRC_LANG', "en")
translate_dst = os.environ.get('TRANSLATE_DST_LANG', "fr")
for document in docs:
    print(f"\n\033[31m Source: {document.metadata['source']} \033[0m")
    if translate_ans:
        document.page_content = GoogleTranslator(source=translate_src, target=translate_dst).translate(document.page_content)
        print(f"\033[32m\033[2m : {document.page_content} \033[0m")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was already thinking about it. I was going to implement it as soon as I had some free time :)

@@ -0,0 +1,7 @@
export LLAMA_CUBLAS=1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in your case :

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1

are not needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that flag are for windows. this script work with linux. I already tested it :)
Later i will check for windows

MODEL_PATH=/path/for/model
#best english embeddings model
#best italian efederici/sentence-it5-base
EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this one uses 768 dimensionalities and might not work with most models like ggml-vic13b-q5_1.bin or koala-7B.ggmlv3.q8_0.bin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also concerned about this. But it seems to work well in my tests. I am using it to take formulas from my physics book while studying hahaha.
Sometimes he gets it wrong, but I'm not sure it depends on this

@DanielusG
Copy link
Author

How does adding n_gpu_layers and use_mlock helps in performance?

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)

if the user have a Nvidia GPU, part of the model will be offloaded on gpu, and it accelerate things. mlock prevent disk read, so more performance. I added them as options in the .env precisely to let the user choose whether he wants these improvements or not

privateGPT.py Outdated
case "GPT4All":
llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
case "OpenAI":
llm = OpenAI(model=model_path,openai_api_key=os.environ.get('OPENAI_API_KEY'),streaming=True,callbacks=callbacks, verbose=False)
Copy link

@sime2408 sime2408 May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add checks if the key is present in the .env file?

if not os.environ.get('OPENAI_API_KEY'):
    print("Add your OPENAI_API_KEY to .env variable. Get your OpenAI API key from [here](https://platform.openai.com/account/api-keys).\n")
    exit(0)

Also, you'll have to check if the key is valid by wrapping the q / a section:

try:
       res = qa(query)
       answer, docs = res['result'], [] if args.hide_source else res['source_documents']
       # .....
       # rest of the code...

except AuthenticationError as e:
    print(f"Warning: An exception occurred. Your OPENAI_API_KEY is invalid: {e.error}")

@imartinez
Copy link
Collaborator

Thanks for the contribution! Looking forward to merging this! Some comments/questions:

  • Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?

  • I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.

  • I guess some updates to the readme would be required.

Thanks again!

@DanielusG
Copy link
Author

Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?

I had also opened an issue about it #493
Basically, after a while of debugging, I ran the code in debug mode and analysed line by line how far the programme would get stuck for a long time, and after quite a while I discovered that when the langchain library made the call to LlamaCpp, it passed it the parameter n_batch = 8

For those who don't know, n_batch indicates how many tokens at a time are to be processed by the llama context, being so low, with a context of 1000 tokens, a for loop of 1000/8 times was therefore executed, which is extremely slow and heavy. In fact, the default value used in the main repository of llama.cpp is 512. After some experimentation, I noticed that 1024 seems a good value.

I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.

Yes, this is perfectly fine, in fact I even wrote it above. I simply left it as I was having a bad performance with llama 13b and wanted to test whether GPT 3.5 was better.

In the end I solved it by using vicuna 13b 1.1 q5_1

I guess some updates to the readme would be required.

Yes, i think too, if you want i can do it, but my english is not so good hahah

@DanielusG
Copy link
Author

@imartinez as I understand that many people may not have sufficient computing power to run this code, if you want I can create a new branch in my fork and leave openAI active. What do you think?

- The intergration of OpenAI does not reflect the purpose of this project, so it will be removed
README.md Outdated Show resolved Hide resolved
@darrinh
Copy link

darrinh commented May 30, 2023

for reference, after running the requirements, I still had to install the following (on clean environment):

  • python -m pip install python-dotenv
  • pip install tqdm
  • pip install langchain
  • pip install chromadb
  • pip install sentence_transformers
  • pip install pip install sentence_transformers
  • pip install llama-cpp-python

the last resulted in:

nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

running I5/32G RAM/Nvidia Titan 12GB VRAM.

nvcc --list-gpu-arch:

compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87

@DanielusG
Copy link
Author

for reference, after running the requirements, I still had to install the following (on clean environment):

  • python -m pip install python-dotenv
  • pip install tqdm
  • pip install langchain
  • pip install chromadb
  • pip install sentence_transformers
  • pip install pip install sentence_transformers
  • pip install llama-cpp-python

the last resulted in:

nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

running I5/32G RAM/Nvidia Titan 12GB VRAM.

nvcc --list-gpu-arch:

compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87

Did you use the bash script?

If so, before starting the script you must execute:
source ./venv/bin/activate.sh
for activate the local enviroment.

@marzim
Copy link

marzim commented Jun 1, 2023

Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G.
Question: what would be the configuration in running this on my mac M1 with only have 8G?

@DanielusG
Copy link
Author

Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. Question: what would be the configuration in running this on my mac M1 with only have 8G?

It appears that the way in which llama.cpp loads and processes the model on M1 processors is different than on normal processors. So privateGPT without this pull request should still work fine for you. Unfortunately I don't have a MacBook available to give you any information, sorry

@sime2408
Copy link

sime2408 commented Jun 2, 2023

Any talks to run cuda inside docker? heard somewhere that it's possible:

Nvidia CUDA in a Docker container:
1. run nvidia-smi on host, it needs to run successfully
2. install nvidia-container-toolkit
3. restart Docker process
4. run a test container like so

docker run --gpus all nvidia/cuda:12.1.1-base-ubi8 nvidia-smi

Should be same as 1.

Once it works for you, you can

  1. pull this PR (not merged, but working): docker: add support for CUDA in docker ggerganov/llama.cpp#1461
  2. download a model from here, I tried with the smallest one:
  3. https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main
  4. to create a Docker image locally, there is a description of how to do it in PR
  5. start the process like this
    docker run --rm --gpus all -v ~/Development/LLM/Models/:/models local/llama.cpp:light-cuda -m /models/llama-7b-ggml-v2-q4_0.bin -p " Here's a haiku about a rotten banana" -n 512 --n-gpu-layers 1

(changed path to dir with models, clearly)

https://docs.docker.com/config/containers/resource_constraints/

@ACoderLife
Copy link

Hi @DanielusG , I'm interested if you are of keeping an OpenAI branch to try out. Would need the readme updated on that branch to point out the MODEL_TYPE=OpenAI and anything else. Thanks.

@JasonGilholme
Copy link

@sime2408

Any talks to run cuda inside docker? heard somewhere that it's possible

This is definitely possible. I've used the tech you mention to deploy instant-ngp in a restricted environment that ran an older OS. Performance was great, and forwarding the UI out of the container was also possible if you have a need for a GUI.

Interestingly, it's also used by cog to streamline the deployment of ML models via docker containers. It attempts to make the packaging of the dependencies less of a headache. Not sure if that system would suit the needs of this repo, but could be worth a look as well.


auto_translate = os.environ.get("AUTO_TRANSLATE")

def translate(text):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd remove the translation feature from this PR so that the great core improvements can be viewed separately since this feature might be a bit controversial (requires internet, uses Google Translate,..).
If you decide to leave them in though it would be awesome if you could mention it in the README and add AUTO_TRANSLATE to the example.env

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest all the changes I have made to in this PR are changes I had to make to make privateGPT functional for me. And so I thought that just as they are useful for me they might be useful for someone else. In any case, actually the translation function as useful as it is, it totally goes against the purpose of this project, so yes, I will remove it.
I was thinking, however, of having the model locally translate the prompt

For example asking Vicuna "If this text is not English, translate it into English." After Vicuna does the translation you use Vicuna's response to execute the prompt. This would ensure privacy

@riverar
Copy link

riverar commented Jun 6, 2023

I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!

For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.

@DanielusG
Copy link
Author

I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!

For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.

I am not yet good with github, what would you suggest I do? close this PR and open several with individual features?
How do I remove merge changes in my master branch?
Thanks for your patience 🥲

@sime2408
Copy link

sime2408 commented Jun 6, 2023

@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository

To create branches is super easy:

After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME.

Commit, push, create PR

@DanielusG
Copy link
Author

@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository

To create branches is super easy:

After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME.

Commit, push, create PR

OK clear, I will try to do something as soon as I finish work :)

@quantumalchemy
Copy link

yeah tried this fork yesterday .. 5x times slower on ingest and query -- looks like lots of for nada

@DanielusG
Copy link
Author

yeah tried this fork yesterday .. 5x times slower on ingest and query -- looks like lots of for nada

I'm sorry but nothing has been changed in the ingest, so if you strangely find that this fork is slower in the ingest, you are obviously misconfiguring something :)

It would also be useful to understand on which platform/drivers/hardware you ran this fork and maybe your comment will be more useful... thanks :D

@DanielusG DanielusG mentioned this pull request Jun 6, 2023
@DanielusG
Copy link
Author

@riverar @sime2408 I have created a new PR with only the performance changes let me know what you think (#649)

@janvarev
Copy link

janvarev commented Jun 7, 2023

@DanielusG Hi!
May be you interested in my PR too #325

I've implemented the same way OpenAI and KoboldAPI endpoints, and also an advanced translation stuff (Google Translate and OneRingTranslator (for 7 different translators, even offline)). Feel free to use my code.

@DanielusG
Copy link
Author

@DanielusG Hi! May be you interested in my PR too #325

I've implemented the same way OpenAI and KoboldAPI endpoints, and also an advanced translation stuff (Google Translate and OneRingTranslator (for 7 different translators, even offline)). Feel free to use my code.

@janvarev Hi! actually the goal of my modifications was performance increase, then I found myself making those improvements like translation. I assume you implemented them better than I did, I just coded them on the fly

Thank you for your work!

@sime2408
Copy link

sime2408 commented Jun 7, 2023

@riverar @sime2408 I have created a new PR with only the performance changes let me know what you think (#649)

@DanielusG On my GPU I notice a very fast loading of LLM, like 1 second but waiting for the database for similarity search took more 52 seconds (+ GoogleTransla) + searching for the answer k = 4, no MLOCK used, on ingest of 104 pdf/epub files. The model I used is WizardLM-7B-uncensored.ggmlv3.q8_0.bin and you must know my setup:

  • MSI nVidia GeForce RTX4090 Ventus 3X OC, 24 GB GDDR6X, DLSS 3
  • motherboard: Asus ROG STRIX B660-G Gaming WIFI, socket 1700, DDR5, HDMI,DP, Wi-Fi 6, BT 5.2
  • INTEL Core i9 12900KF
  • RAM KINGSTON 64 GB (2x32 GB) DDR5, 5600 MHz, DIMM, Fury Beast, CL40
  • cooler: ARCTIC Freezer 34 eSports DUO Black/Grey
  • power supply: GIGABYTE UD1000GM PG5, 1000 W, ultra-durable, PCIe Gen 5.0,
  • SSD disk 1 TB, SAMSUNG 870 QVO, 2.5", SATA III, 4bit MLC V-NAND, MZ-77Q1T0BW
  • (OS installed) SSD disk 500 GB, SAMSUNG 980, M.2 2280, PCIe 3.0 x4 NVMe, 3-bit MLC V-NAND, MZ-V8V500BW
  • OS: Win 11

@DanielusG
Copy link
Author

DanielusG commented Jun 7, 2023

@riverar @sime2408 I have created a new PR with only the performance changes let me know what you think (#649)

@DanielusG On my GPU I notice a very fast loading of LLM, like 1 second but waiting for the database for similarity search took more 52 seconds (+ GoogleTransla) + searching for the answer k = 4, no MLOCK used, on ingest of 104 pdf/epub files. The model I used is WizardLM-7B-uncensored.ggmlv3.q8_0.bin and you must know my setup:

  • MSI nVidia GeForce RTX4090 Ventus 3X OC, 24 GB GDDR6X, DLSS 3
  • motherboard: Asus ROG STRIX B660-G Gaming WIFI, socket 1700, DDR5, HDMI,DP, Wi-Fi 6, BT 5.2
  • INTEL Core i9 12900KF
  • RAM KINGSTON 64 GB (2x32 GB) DDR5, 5600 MHz, DIMM, Fury Beast, CL40
  • cooler: ARCTIC Freezer 34 eSports DUO Black/Grey
  • power supply: GIGABYTE UD1000GM PG5, 1000 W, ultra-durable, PCIe Gen 5.0,
  • SSD disk 1 TB, SAMSUNG 870 QVO, 2.5", SATA III, 4bit MLC V-NAND, MZ-77Q1T0BW
  • (OS installed) SSD disk 500 GB, SAMSUNG 980, M.2 2280, PCIe 3.0 x4 NVMe, 3-bit MLC V-NAND, MZ-V8V500BW
  • OS: Win 11

i think with your setup you can execute the model without using llama.cpp but loading a 7b model with your GPU. Actually this project doesn't provide a way for load a model in that way but i can try to do something in my fork.

the slowness you are experiencing is definitely due to the fact that llama.cpp must first process the context and then write the output. the search should take very little time, that is not the problem.

Are you sure you installed torch with CUDA enabled? If you have not installed toch with CUDA, vector search will be done with CPU and it is VERY slow, with my GPU (RTX 2060) for a search of 10 documents takes about 2-3 seconds

@sime2408
Copy link

sime2408 commented Jun 7, 2023

Are you sure you installed torch with CUDA enabled? If you have not installed toch with CUDA, vector search will be done with CPU and it is VERY slow, with my GPU (RTX 2060) for a search of 10 documents takes about 2-3 seconds

Can you please show me how to install it, do I need to add something in the requirements.txt?

I used conda env and installed there, python 3.10. I saw you added some explanation to the readme for windows but I don't understand what WIP is?

I guess I should follow https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html

or just run this? pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

@DanielusG
Copy link
Author

or just run this? pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

this is the right command, but first you have to uninstall old torch installation by typing pip uninstall torch

Now I was trying and it really seems that llama.cpp doesn't work on windows with GPU, only with CPU, at most you should try with a linux distro

remember that installing torch with CUDA, will only speed up the vector search not the writing by llama.cpp. But if that's your hardware I don't think llama.cpp should run slowly. Surely, however, you will unlock the full capabilities on a linux system

@sime2408
Copy link

sime2408 commented Jun 7, 2023

@DanielusG yea because llama.cpp is specifically designed for CPU. I'll try with dual boot Debian, installing all of this has some differences, after a few seconds here and there, I can see the coolers on my graphic card start yelling :D Will try with other models too

@DanielusG
Copy link
Author

@DanielusG yea because llama.cpp is specifically designed for CPU. I'll try with dual boot Debian, installing all of this has some differences, after a few seconds here and there, I can see the coolers on my graphic card start yelling :D Will try with other models too

@sime2408 My philosophy is, you paid for the whole GPU, use the whole GPU, so go... make it yell because that's what it was born for! (If you want you can keep me updated by contacting me on discord: DanielusG26#2745)

@sime2408
Copy link

sime2408 commented Jun 8, 2023

@sime2408 My philosophy is, you paid for the whole GPU, use the whole GPU, so go... make it yell because that's what it was born for! (If you want you can keep me updated by contacting me on discord: DanielusG26#2745)

sent you a request on Discord :) thanks, man, I am still struggling to make it work that speed as you did.

I am orbita24#3506 there. Actually, let's have our own room so other people can join too: I'll show you what I have:

https://discord.gg/SaYbrNse

By the way this is my repo: https://github.com/sime2408/scrapalot-chat

@darrinh
Copy link

darrinh commented Jun 9, 2023 via email

@sime2408
Copy link

sime2408 commented Jun 11, 2023

@imartinez Some research I found together with @DanielusG indicates we can merge this story let's say to a separate branch or we must set the flag to turn on / of GPU acceleration. With the help of @DanielusG I manage to run it on my Windows PC, didn't test it on Linux and MacOS but I gather some experiences I can list here. Note this worked for llama-cpp-python==0.1.61

GPU acceleration

Most importantly the GPU_IS_ENABLED variable must be set to true. Add this to HuggingfaceEmbeddings:

embeddings_kwargs = {'device': 'cuda'} if gpu_is_enabled else {}

GPU (Windows)

Set OS_RUNNING_ENVIRONMENT=windows inside .env file

pip3 install -r requirements_windows.txt

Install Visual Studio 2019 - 2022 Code C++ compiler on Windows 10/11:

  1. Install Visual Studio.
  2. Make sure the following components are selected:
    • Universal Windows Platform development
    • C++ CMake tools for Windows
  3. Download the MinGW installer from the MinGW website.
  4. Run the installer and select the gcc component.

You can use the included installer batch file to install the required dependencies for GPU acceleration, or:

  1. Find your card driver here NVIDIA Driver Downloads

  2. Install NVidia CUDA 11.8

  3. Install llama-cpp-python package with cuBLAS enabled. Run the code below in the directory you want to build the package in.

    • Powershell:
    $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    • Bash:
    CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
  4. Enable GPU acceleration in .env file by setting GPU_IS_ENABLED to true

  5. Run ingest.py and privateGPT.py as usual

If the above doesn't work for you, you will have to manually build llama-cpp-python library with CMake:

  1. Get repo git clone https://github.com/abetlen/llama-cpp-python.git,
    • switch to tag this application is using from requirements-*.txt file:
    • uninstall your local llama-cpp-python: pip3 uninstall llama-cpp-python
  2. Open llama-cpp-python/vendor/llama.cpp/CMakeList.txt in text editor and add
    set(LLAMA_CUBLAS 1) to the line 178 before if (LLAMA_CUBLAS) line.
  3. Install CMake
  4. Go to cd llama-cpp-python and perform the actions:
    • perform git submodule update --init --recursive
    • mkdir build and cd build
  5. Build llama-cpp-python yourself:
    cmake -G "Visual Studio 16 2019" -A x64 -D CUDAToolkit_ROOT="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8" .. 
  6. Position CLI to this project and install llama from the folder you build, let's say pip3 install ../llama-cpp-python/

Next is somewhat @DanielusG already tested on Linux I guess:

GPU (Linux):

Set OS_RUNNING_ENVIRONMENT=linux inside .env file

If you have an Nvidia GPU, you can speed things up by installing the llama-cpp-python version with CUDA
by setting these flags: export LLAMA_CUBLAS=1

(some libraries might be different per OS, that's why I separated requirements files)

pip3 install -r requirements_linux.txt

First, you have to uninstall the old torch installation and install CUDA one:
Install a proper torch version:

pip3 uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Now, set environment variables and source them:

vim ~/.bashrc
export LLAMA_CUBLAS=1
export LLAMA_CLBLAST=1 
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1
source ~/.bashrc

LLAMA

llama.cpp doesn't work easily on Windows with GPU,, so you should try with a Linux distro
Installing torch with CUDA will only speed up the vector search, not the writing by llama.cpp.

You should install the latest Cuda toolkit:

conda install -c conda-forge cudatoolkitpip uninstall llama-cpp-python

if you're already in conda env you can uninstall llama-cpp-python like this:

pip3 uninstall llama-cpp-python

Install llama:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.61 --no-cache-dir

Modify LLM code to accept n_gpu_layers:

llm = LlamaCpp(model_path=model_path, ..., n_gpu_layers=20)

Change environment variable model:

MODEL_TYPE=llamacpp
MODEL_ID_OR_PATH=models/ggml-vic13b-q5_1.bin

@imartinez imartinez added the primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT label Oct 19, 2023
@imartinez imartinez closed this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT
Projects
None yet
Development

Successfully merging this pull request may close these issues.