LocalRQA is a toolkit for building and running your own private "ChatGPT" that is an expert on your specific documents. It uses the Retrieval-Augmented Generation (RAG) technique to provide answers that are factually grounded in a knowledge base you provide, ensuring all processing is handled locally to maintain data privacy.
- Retrieval-Augmented Generation (RAG): Ensures answers are based on provided documents, reducing factual errors and hallucinations.
- 100% Local & Private: Your documents, questions, and the AI model's processing never leave your machine.
- GPU Accelerated: Designed to run on local NVIDIA GPUs for high-performance inference.
- Interactive UI: Comes with a Gradio-based web interface for easy interaction and demonstration.
The original installation process is not compatible with modern Windows environments. The following is a definitive guide that includes all necessary fixes.
Before you begin, ensure you have installed the following:
- Git for Windows: Download here
- Python 3.10 (64-bit): Download here.
- CRITICAL: During installation, you MUST check the box "Add Python 3.10 to PATH".
- Visual Studio Build Tools: Download here.
- CRITICAL: During installation, you MUST select the "Desktop development with C++" workload.
Open your terminal (Windows PowerShell or Command Prompt).
# Clone the project from GitHub
git clone [https://github.com/jasonyux/LocalRQA.git](https://github.com/jasonyux/LocalRQA.git)
# Navigate into the project folder
cd LocalRQA
# Create a Python 3.10 virtual environment
py -3.10 -m venv rtx_env
# Activate the environment
.\rtx_env\Scripts\activateYou must make these changes before installing dependencies.
A. Edit setup.py:
- Open the
setup.pyfile. - Inside the
install_requires=[ ... ]list, find and delete the entire lines for'deepspeed','faiss-gpu', and'flash_attn'. - Save the file.
B. Check huggingface.py:
- Open the file
local_rqa\qa_llms\huggingface.py. - Ensure the line
model = model.cuda()(around line 58) is active (it should NOT have a#in front of it).
C. Create Runner Scripts:
-
In the main
LocalRQAfolder, create the following three new files:-
run_controller.py:import subprocess import sys subprocess.run([sys.executable, 'local_rqa/serve/controller.py'] + sys.argv[1:])
-
run_worker.py:import subprocess import sys subprocess.run([sys.executable, 'local_rqa/serve/model_worker.py'] + sys.argv[1:])
-
run_web.py:import subprocess import sys subprocess.run([sys.executable, 'local_rqa/serve/gradio_web_server.py'] + sys.argv[1:])
-
D. Prepare the Database:
- Run these commands in your activated terminal:
# Create the folder the code expects mkdir example\databricks\database_fixed # Copy and rename the database file copy example\databricks\database\databricks.pkl example\databricks\database_fixed\documents.pkl
Run these commands one by one in your activated (rtx_env) terminal.
# 1. Install the GPU version of PyTorch for CUDA 11.8
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)
# 2. Install the stable CPU version of FAISS for Windows
pip install faiss-cpu
# 3. Install missing dependencies
pip install -U langchain-community uvicorn
# 4. Install the project itself and all other requirements
pip install -e .After a successful installation, open three separate terminals. In each one, navigate to your LocalRQA folder and activate the environment (.\rtx_env\Scripts\activate).
-
In Terminal 1 (Launch the Controller):
python run_controller.py
-
In Terminal 2 (Launch the Model Worker): Note: The first time you run this, it will download the language model (approx. 2 GB). This command is tailored for GPUs with low VRAM (e.g., RTX 3050).
python run_worker.py --qa_model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --database_path example/databricks/database_fixed --load_8bit
Wait for this to finish loading. It will say "Uvicorn running on..."
-
In Terminal 3 (Launch the Web UI): Wait for Terminal 2 to be ready, then run this.
python run_web.py --example "What is LocalRQA?"
Finally, open your web browser and navigate to http://localhost:7860 to use the application.
The system uses a two-step "Open-Book Exam" process:
- Retrieval: When you ask a question, the system first searches its knowledge base (the
documents.pklfile) to find the most relevant text chunks. - Generation: It then gives your question and these retrieved chunks to the language model (TinyLlama), which generates an answer based only on the provided facts.