GPU question #217

2023-05-16T04:28:29Z

mmsquantum
May 16, 2023

I'm curious to setup this model myself. I have two 3090's and 128 gigs of ram on an i9 all liquid cooled. Would the GPU play any relevance in this or is that only used for training models?

maozdemir · 2023-05-16T06:30:38Z

maozdemir
May 16, 2023

Unfortunately not. The current implementation would work with CPU only. I am trying to make this work on GPU too.

So far, the first few steps I can provide are:
1 - https://github.com/abetlen/llama-cpp-python - Install using this:
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; pip3 install llama-cpp-python
Enables the use of CUDA.
2 - https://github.com/hwchase17/langchain - Homebrew the latest commit: Adds the GPU implementation that was introduced to llama.cpp 2 days ago.
3 - Modify the ingest.py and privateGPT.py by adding n_gpu_layers=n argument into LlamaCppEmbeddings method. Can't test it due to the reason below.
4 - Deal with this error:

error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama_init_from_file: failed to load model

Update: I have successfully ran the model on my GPU. Planning to drop commits.

1 reply

thekit May 17, 2023

was error 4 just about choosing the correct requantized q5 model?

please link the commit you made to this discussion so we can follow along

EDIT: I have written some notes here #275 on how I got it installed and running.

I am not seeing any speedup or activity on my discrete graphics card GPU.

Is there a way to target my graphics card?

langchain-ai/langchain#4739

here is how I configured it so it runs without errors but it is very slow

https://github.com/abetlen/llama-cpp-python - Install using this:
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; pip3 uninstall llama-cpp-python; pip3 install llama-cpp-python
Enables the use of CUDA.
pip uninstall langchain; pip install langchain this brings you up to langchain 0.0.172 importantly this adds the gpu layers parameter llama-cpp: add gpu layers parameter langchain-ai/langchain#4739
Modify the ingest.py and privateGPT.py by adding n_gpu_layers=n argument into LlamaCppEmbeddings method so it looks like this llama=LlamaCppEmbeddings(model_path=llama_embeddings_model, n_ctx=model_n_ctx, n_gpu_layers=3)

use the q5 re quantized model from huggingface to get rid of errors
https://huggingface.co/TheBloke/gpt4-x-vicuna-13B-GGML/tree/main

zboinek · 2023-05-16T10:43:27Z

zboinek
May 16, 2023

@maozdemir please ping me at jakub.zboina@comtegra.pl when you drop new commit with GPU support. Thanks :)

0 replies

eaugustine30 · 2023-05-16T13:46:03Z

eaugustine30
May 16, 2023

I had been working on CUDA support yesterday with no luck. Glad to hear you had some success. Waiting for the commit, as well.

0 replies

maozdemir · 2023-05-16T14:20:16Z

maozdemir
May 16, 2023

https://github.com/maozdemir/privateGPT-colab/blob/main/privateGPT-colab.ipynb

Set n_gpu_layers=500 for colab in LlamaCpp and LlamaCppEmbeddings functions, also don't use GPT4All, it won't run on GPU.

6 replies

shondle May 18, 2023

Here's what I'm using. Note this is using the sentence transformers addition for the embeddings which makes ingesting much quicker.

PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/koala-7B.ggml.q4_0.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

Still figuring out GPU stuff, but loading the Llama model is working just fine on my side. I think your issue is because you are using the gpt4all-J model. Try the ggml-model-q5_1.bin or koala model instead (although I believe the koala one can only be run on CPU - just putting this here to see if you can get past the errors).

maozdemir May 18, 2023

my .env is below, what do I have to change my model type to in order to get gpu to work? do I just type in LlamaCpp? PERSIST_DIRECTORY=db LLAMA_EMBEDDINGS_MODEL=models/gpt4-x-vicuna-13B.ggml.q5_1.bin MODEL_TYPE=GPT4All MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin MODEL_N_CTX=1000

if I change MODEL_TYPE=LlamaCpp

I get the following crash PS C:\ai_experiments\privateGPT> python .\privateGPT.py llama.cpp: loading model from models/gpt4-x-vicuna-13B.ggml.q5_1.bin llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32001 llama_model_load_internal: n_ctx = 1000 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 90.75 KB llama_model_load_internal: mem required = 11359.05 MB (+ 3216.00 MB per state) llama_init_from_file: kv self size = 1562.50 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Using embedded DuckDB with persistence: data will be stored in: db llama.cpp: loading model from models/ggml-gpt4all-j-v1.3-groovy.bin Traceback (most recent call last): File "C:\ai_experiments\privateGPT\privateGPT.py", line 57, in main() File "C:\ai_experiments\privateGPT\privateGPT.py", line 28, in main llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False) File "pydantic\main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp root Could not load Llama model from path: models/ggml-gpt4all-j-v1.3-groovy.bin. Received error [WinError -529697949] Windows Error 0xe06d7363 (type=value_error) Exception ignored in: <function Llama.del at 0x000001A91556DC60> Traceback (most recent call last): File "C:\Users\jinke\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\llama.py", line 1076, in del if self.ctx is not None: AttributeError: 'Llama' object has no attribute 'ctx'

Get koala model. Run my notebook.

Paviraj5598 May 24, 2023

@maozdemir get koala model and running in your cola, in privateGPT.py running i got
Using embedded DuckDB with persistence: data will be stored in :db
CUDA error 35 at /content/drive/MyDrive/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:420: CUDA driver version is insufficient for CUDA runtime version, could you please help on same

AngelTs Jul 13, 2023

Your project not work and of course not have Issue section:

(C:\privateGPT-gpu\privateGPT-gpu) C:\privateGPT-gpu>python privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
File "C:\privateGPT-gpu\privateGPT.py", line 101, in
main()
File "C:\privateGPT-gpu\privateGPT.py", line 56, in main
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_gpu_layers=calculate_layer_count())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pydantic\main.py", line 341, in pydantic.main.BaseModel.init
pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp
root
Could not load Llama model from path: models/TheBloke_orca_mini_3B-GGML/orca-mini-3b.ggmlv3.q4_0.bin. Received error [WinError -1073741795] Windows Error 0xc000001d (type=value_error)

maozdemir Jul 14, 2023

Did you just accuse me of "sharing misleading code" and "disabling issues to avoid people noticing"?

No, I wasn't even aware of issues being disabled.

Windows Error 0xc000001d is when llama-cpp binaries are problematic/doesn't exist. Try compiling from the sources.

shondle · 2023-05-21T16:06:36Z

shondle
May 21, 2023

Hi all, on Windows here but I finally got inference with GPU working!

(These tips assume you already have a working version of this project, but just want to start using GPU instead of CPU for inference).

Install the CUDA tookit. Note: THIS ONLY WORKED FOR ME WHEN I INSTALLED IN A CONDA ENVIRONMENT. Earlier, when I had installed directly to my computer, llama-cpp-python could not find it on reinstallation, leading to GPU inference not working. -
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit
Uninstall current version of llama-cpp-python. This is to ensure the new version you have is compatible with using GPU, as earlier versions weren't
pip uninstall llama-cpp-python
Install llama-cpp-python. Note: YOU MUST REINSTALL WHILE NOT LETTING PIP USE THE CACHE (as shown by the --no-cache-dir flag). Otherwise, your version will not be updated. Also, as layed out below, you have to set some variables beforehand.
set LLAMA_CUBLAS=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir --verbose
- Note: I didn't include quotation marks after CMAKE_ARGS for DLLAMA_CUBLAS. This is for Windows, and you will may/will need to include quotation marks for other systems.
- Ensure you see the following lines while installing. If you don't, and see some issues with CUDA, figure out how to install them. As stated earlier, I had a problem of it not being able to find the path of my CUDA toolkit, but installing the toolkit in an anaconda environment resolved the issue. (you can only see these lines if you make sure to include the verbose flag while installing as shown above)

    -- Found CUDAToolkit: C:/Users/itsjo/anaconda3/envs/privateGPT/include (found version "12.1.105")
    -- cuBLAS found
    -- The CUDA compiler identification is NVIDIA 12.1.105
    -- Detecting CUDA compiler ABI info
    -- Detecting CUDA compiler ABI info - done
    -- Check for working CUDA compiler: C:/Users/shondle/anaconda3/envs/privateGPT/bin/nvcc.exe - skipped
    -- Detecting CUDA compile features
    -- Detecting CUDA compile features - done

Change privateGPT line for initializing the LLama model to have an n_gpu_layers argument. This should be enough layers so that while running the model, it takes up close to 100% of the memory on your GPU without going overboard (or you'll run into the CUDA out of memory error that we all hate)
- Here is my example. Note that your n_gpu_layers will likely be different and it is worth experimenting with the n_threads as well.
  llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=True, n_gpu_layers=20)
Install a Llama-cpp compatible model. Make sure to place it in the models directory in the privateGPT project.
- Here are lists of model that seem to be working for me-
  - https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main
  - https://huggingface.co/eachadea/ggml-vicuna-7b-1.1/tree/main
- Make sure when you install, that you are installing a model that does not have old in the title description. So, install one of the models near the bottom of the page. For example, I am currently using eachadea/ggml-vicuna-13b-1.1.
Change your .env file for using the Llama model. Here is what I am using currently-

PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/ggml-vic13b-q5_1.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

After ingesting with ingest.py, run privateGPT.py. If it is offloading to the GPU correctly, you should see these two lines stating that CUBLAS is working.

llama_model_load_internal: [cublas] offloading 20 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 4537 MB

This issue was quite helpful to me if you aren't able to get it working with tips above: abetlen/llama-cpp-python#250

8 replies

maozdemir May 22, 2023

thx bro. 3090, venv, what do you think, how many layers can be put for 3090?

Max. layers is 32. A 3090 can handle that singlehandedly. Chunk size is about the VRAM, so if you are going to need it, you can resize it.

seamus-code May 23, 2023

Thank you! This works for me. It's around 5 - 6 times faster to use GPU for me

RVoelk May 26, 2023

Thanks for putting this up! I tried these on my Linux machine and while I am now clearly using the new model I do not appear to be using either of the GPU's (3090). I do not get these messages when running privateGPT.py
llama_model_load_internal: [cublas] offloading 20 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 4537 MB

I do not get any errors indicating why it might not use the GPU. Any additional ideas would be very much appreciated.

bradsec May 29, 2023

I'm able to run max layers 40, haven't tried on large amount of source documents yet, just a few PDF's. A lot quicker with output text now it's using GPU.

# All commands for fresh install privateGPT with GPU support.
# My system - Intel i7, 32GB, Debian 11 Linux with Nvidia 3090 24GB GPU, using miniconda for venv

# Create conda env for privateGPT
conda create -n pgpt python=3.10.6
conda activate pgpt

# Clone repo
git clone https://github.com/imartinez/privateGPT.git
cd privateGPT
pip3 install -r requirements.txt

# Nvidia cuda install (still with pgpt conda environment activated)
# Linux uses "export" not "set" like windows for setting environment variables

conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit
pip uninstall llama-cpp-python
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

# Downloaded model from https://huggingface.co/ I used ggml-vic13b-q5_1.bin
# place in models folder within privateGPT folder
# Model path will be /privateGPT/models/ggml-vic13b-q5_1.bin

# Edit /privateGPT/.env file contents will look like below:

PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/ggml-vic13b-q5_1.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

# Edit /privateGPT/privateGPT.py 
# If using LlamaCpp model edit the case for LlamaCpp and change line to the following:
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_gpu_layers=40, callbacks=callbacks, verbose=False)

# All I added was the n_gpu_layers=40 (40 seems to be max and uses a 9GB or VRAM), decreased layers depending on GPU.
# Put you documents into the /privateGPT/source_documents folder and run the ingest.py
python ingest.py

# Good to go run the privateGPT.py
python privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vic13b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1000
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB

siddhsql Jun 6, 2023

I get this:

$ python privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vic13b-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1000
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1405)
llama_init_from_file: failed to load model
Traceback (most recent call last):
  File "/mnt/disks/sdb/privateGPT/privateGPT.py", line 77, in <module>
    main()
  File "/mnt/disks/sdb/privateGPT/privateGPT.py", line 35, in main
    llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_gpu_layers=40, callbacks=callbacks, verbose=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp
__root__
  Could not load Llama model from path: models/ggml-vic13b-q5_1.bin. Received error  (type=value_error)

I downloaded the model from https://huggingface.co/vicuna/ggml-vicuna-13b-1.1/resolve/6327e9737996552a34857a6ceb73352efd2633b4/ggml-vic13b-q5_1.bin. how to fix?

pseudotensor · 2023-05-22T17:26:29Z

pseudotensor
May 22, 2023

In case you guys are curious, our h2oGPT is focused more on GPU, has fully UI, and otherwise is like privateGPT: https://github.com/h2oai/h2ogpt

2 replies

NicolasMejiaPetit May 23, 2023

I checked this repo from my understanding it doesn't support quantized models, but it is in the future plans.

pseudotensor May 23, 2023

It supports GPT4All and llama.cpp on mac and windows as of yesterday.

maozdemir · 2023-05-23T15:17:20Z

maozdemir
May 23, 2023

check #425

0 replies

marcothedeveloper123 · 2023-05-24T09:40:01Z

marcothedeveloper123
May 24, 2023

what about using MPS for training and the Neural Engine for inference on Apple Silicon?

0 replies

GiusTex · 2023-05-27T15:30:59Z

GiusTex
May 27, 2023

pip install llama-cpp-python --no-cache-dir --verbose

If someone get stuck here (like me) with this error: no cuda toolset found, you can, as said here, go here C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\visual_studio_integration\MSBuildExtensions and copy the four files, then paste them here C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations (the second path it's a bit different for me compared to the original post, but in the end all works and the gpu is used anyway)

0 replies

Setrias · 2023-06-07T17:21:49Z

Setrias
Jun 7, 2023

With a 3060Ti it's for some reason much slower than on my old i5-7400, single core takes like 5 min, and quad core takes nearly 4 minutes... But on the 3060Ti it takes 10 mins+. Anyone knows why it's happening ?

0 replies

ManalIrfan · 2023-06-13T13:04:38Z

ManalIrfan
Jun 13, 2023

For some reason, when i made these changes, im getting this error and i cant fix it

Appending to existing vectorstore at db
Loading documents from source_documents
Loading new documents: 0it [00:00, ?it/s]
No new documents to load

0 replies

hgfernan · 2023-07-17T20:50:28Z

hgfernan
Jul 17, 2023

Hello, @maozdemir !

1st of all, congratulations for effort to providing GPU support to privateGPT.

2nd, I'm starting to use CUDA, and I've just downloaded the CUDA framework for my old fashioned GTX 750 Ti.

Currently NVIDIA provides the version 12.2 for its framework, and no longer 11.8

However, your install demands only the 11.8 version for CUDA.

Do you think it can run in 12.2 too ?

Thanks for any help you can offer.

2 replies

maozdemir Jul 17, 2023

@hgfernan ;

Unfortunately we are limited to what pytorch supports.

Please see https://pytorch.org/get-started/locally/

hgfernan Jul 18, 2023

Thanks for a prompt answer. I'll see how to install CUDA 11.8

The funny thing is that in the NVIDIA official pages there's mention to version 11.8, but it is 12.2 that's available.

Best regards.

maozdemir · 2023-07-18T08:19:25Z

maozdemir
Jul 18, 2023

Thanks for a prompt answer. I'll see how to install CUDA 11.8

The funny thing is that in the NVIDIA official pages there's mention to version 11.8, but it is 12.2 that's available.

Best regards.

You can get it here:
https://developer.nvidia.com/cuda-toolkit-archive

0 replies

franklin-at-franklincircles · 2023-07-19T14:31:02Z

franklin-at-franklincircles
Jul 19, 2023

What about support for running on a Havana Labs Gaudi base box?
https://habana.ai/products/

0 replies

GPU question #217

Replies: 14 comments · 19 replies

Replies: 14 comments 19 replies