This is a fork of: https://github.com/su77ungr/CASALIOY which is a fork of https://github.com/imartinez/privateGPT/ and then added fastapi similar to https://github.com/menloparklab/privateGPT-app/
- fix container build workflow
- add tests
- add more documentation
- add more features
- add more models
- add more datas
Documentation below is from forked repo, container is not built correctly, poetry install might not work,
install all with pip install -r requirements.txt
to start server run uvicorn main:app --reload
pipreqs --ignore bin,etc,include,lib,lib64,models,source_documents,.github --encoding utf-8
First install all requirements:
python -m pip install poetry
python -m poetry config virtualenvs.in-project true
python -m poetry install
. .venv/bin/activate
python -m pip install --force streamlit sentence_transformers # Temporary bandaid fix, waiting for streamlit >=1.23
pre-commit install
If you want GPU support for llama-ccp:
pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --force llama-cpp-python
Edit the example.env to fit your models and rename it to .env
# Generic
MODEL_N_CTX=1024
TEXT_EMBEDDINGS_MODEL=sentence-transformers/all-MiniLM-L6-v2
TEXT_EMBEDDINGS_MODEL_TYPE=HF # LlamaCpp or HF
USE_MLOCK=true
# Ingestion
PERSIST_DIRECTORY=db
DOCUMENTS_DIRECTORY=source_documents
INGEST_CHUNK_SIZE=500
INGEST_CHUNK_OVERLAP=50
# Generation
MODEL_TYPE=LlamaCpp # GPT4All or LlamaCpp
MODEL_PATH=eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
MODEL_TEMP=0.8
MODEL_STOP=[STOP]
CHAIN_TYPE=stuff
N_RETRIEVE_DOCUMENTS=100 # How many documents to retrieve from the db
N_FORWARD_DOCUMENTS=6 # How many documents to forward to the LLM, chosen among those retrieved
# option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
MODEL_REPEAT_PENALTY=1.3
MODEL_TOP_K=50
MODEL_TOP_P=1
MODEL_NO_REPEAT_NGRAM_SIZE=6
This should look like this
└── repo
├── startLLM.py
├── casalioy
│ └── ingest.py, load_env.py, startLLM.py, gui.py, ...
├── source_documents
│ └── sample.csv
│ └── ...
├── models
│ ├── ggml-vic7b-q5_1.bin
│ └── ...
└── .env, convert.py, Dockerfile
👇 Update your installation!
git pull && poetry install
To automatically ingest different data types (.txt, .pdf, .csv, .epub, .html, .docx, .pptx, .eml, .msg)
inside
source_documents
to run tests with.
python retrive/ingest.py # optional <path_to_your_data_directory>
Optional: use y
flag to purge existing vectorstore and initialize fresh instance
python retrive/ingest.py # optional <path_to_your_data_directory> y
This spins up a local qdrant namespace inside the db
folder containing the local vectorstore. Will take time,
depending on the size of your document.
You can ingest as many documents as you want by running ingest
, and all will be accumulated in the local embeddings
database. To remove dataset simply remove db
folder.
In order to ask a question, run a command like:
python retrive/startLLM.py
And wait for the script to require your input.
> Enter a query:
Hit enter. You'll need to wait 20-30 seconds (depending on your machine) while the LLM model consumes the prompt and prepares the answer. Once done, it will print the answer and the 4 sources it used as context from your documents; you can then ask another question without re-running the script, just wait for the prompt again.
Note: you could turn off your internet connection, and the script inference would still work. No data gets out of your local environment.
Type exit
to finish the script.
streamlit run retrive/gui.py
see DISCLAIMER.md
Most importantly the GPU_IS_ENABLED
variable must be set to true
. Add this to HuggingfaceEmbeddings:
embeddings_kwargs = {'device': 'cuda'} if gpu_is_enabled else {}
Set OS_RUNNING_ENVIRONMENT=windows
inside .env
file
pip3 install -r requirements_windows.txt
Install Visual Studio 2019 - 2022 Code C++ compiler on Windows 10/11:
-
Install Visual Studio.
-
Make sure the following components are selected:
- Universal Windows Platform development
- C++
CMake
tools for Windows
-
Download the
MinGW
installer from the MinGW website. -
Run the installer and select the
gcc
component.
You can use the included installer batch file to install the required dependencies for GPU acceleration, or:
-
Find your card driver here NVIDIA Driver Downloads
-
Install NVidia CUDA 11.8
-
Install
llama-cpp-python
package withcuBLAS
enabled. Run the code below in the directory you want to build the package in.- Powershell:
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
- Bash:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
-
Enable GPU acceleration in
.env
file by settingGPU_IS_ENABLED
totrue
-
Run
ingest.py
andprivateGPT.py
as usual
If the above doesn't work for you, you will have to manually build llama-cpp-python library with CMake:
-
Get repo
git clone https://github.com/abetlen/llama-cpp-python.git
,- switch to tag this application is using from
requirements-*.txt
file: - uninstall your local llama-cpp-python:
pip3 uninstall llama-cpp-python
- switch to tag this application is using from
-
Open
llama-cpp-python/vendor/llama.cpp/CMakeList.txt
in text editor and addset(LLAMA_CUBLAS 1)
to the line178
beforeif (LLAMA_CUBLAS) line
. -
Install CMake
-
Go to
cd llama-cpp-python
and perform the actions:- perform
git submodule update --init --recursive
mkdir build
andcd build
- perform
-
Build llama-cpp-python yourself:
cmake -G "Visual Studio 16 2019" -A x64 -D CUDAToolkit_ROOT="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8" ..
-
Position CLI to this project and install llama from the folder you build, let's say
pip3 install ../llama-cpp-python/
Next is somewhat @DanielusG already tested on Linux I guess:
Set OS_RUNNING_ENVIRONMENT=linux
inside .env
file
If you have an Nvidia GPU, you can speed things up by installing the llama-cpp-python
version with CUDA by setting these flags: export LLAMA_CUBLAS=1
(some libraries might be different per OS, that's why I separated requirements files)
pip3 install -r requirements_linux.txt
First, you have to uninstall the old torch installation and install CUDA one: Install a proper torch version:
pip3 uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Now, set environment variables and source them:
vim ~/.bashrc
export LLAMA_CUBLAS=1
export LLAMA_CLBLAST=1
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1
source ~/.bashrc
llama.cpp doesn't work easily on Windows with GPU,, so you should try with a Linux distro Installing torch with CUDA will only speed up the vector search, not the writing by llama.cpp.
You should install the latest Cuda toolkit:
conda install -c conda-forge cudatoolkitpip uninstall llama-cpp-python
if you're already in conda env you can uninstall llama-cpp-python like this:
pip3 uninstall llama-cpp-python
Install llama:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.61 --no-cache-dir
Modify LLM code to accept n_gpu_layers
:
llm = LlamaCpp(model_path=model_path, ..., n_gpu_layers=20)
Change environment variable model:
MODEL_TYPE=llamacpp
MODEL_ID_OR_PATH=models/ggml-vic13b-q5_1.bin