Badge source
"Talk to your PDF" is an interactive app that allows users to upload PDF documents and ask questions about their content. Utilizing advanced NLP techniques like retrieval augmented generation (RAG), the app extracts text from the PDFs, generates embeddings of the PDF text and question, and provides relevant responses to user queries without the need for manual document navigation. This app is built from scratch, no framework (like LangChain) used.
- PDF text extraction and embedding generation for efficient information retrieval.
- User query processing with intent detection to ensure relevant and safe interactions.
- Database storage of PDF embeddings for quick access and response generation.
- Integration with Google Drive for secure and convenient file management.
- Clean and intuitive interface
- Python 3.11+
- Streamlit for the web interface
- OpenAI for embeddings and intent detection (using the moderation model) and response generation (using GPT-4 turbo)
- PostgreSQL with Supabase for database management
- pgvector PostgreSQL extension to handle vectors embedding
- Google Drive API for the PDF storage
- Requests and Tempfile for handling HTTP requests and temporary file management
- SQLAlchemy for connecting to the PostgreSQL database using Python to run SQL queries
Indexing
-
Loader: This component is responsible for loading the knowledge base into the system. The knowledge base contains a vast collection of information that the RAG system will draw from to generate responses.
-
Documents: After the knowledge base is loaded, it's organized into documents. These documents are then processed and made ready for the next steps. They could be anything from wiki articles to textbooks or any other structured collection of text. In our case we are using text extracted from the PDF.
-
Splitter: The splitter takes these documents and breaks them down into smaller, more manageable pieces known as document snippets. These snippets are easier to work with when matching the information to a query.
-
Embedding Machine: The document snippets are then fed into an embedding machine, which converts the text into embeddings. These embeddings are high-dimensional vectors (3072 for this app) that represent the semantic meaning of the snippets.
-
Embeddings: The result is a set of embeddings, each corresponding to a document snippet from the knowledge base.
-
Vector Database: The embeddings are stored in a vector database(PostgreSQL with pgvector extension). This database allows for efficient similarity searches; for example, when the system needs to find the snippet most closely related to a given query, it can perform a "closeness" query.
Retrieval
-
Question: This represents the user's query/question regarding the PDF, such as "How do I do X?"
-
Embedding Machine: The same embedding machine used during indexing processes the question to convert it into an embedding.
-
Embeddings: The question embedding is then used to search the vector database for the most relevant document snippets.
-
Vector Database: The system queries the vector database with the embedding of the question to find the closest matching document snippets.
-
Relevant Snippets: The most relevant snippets are retrieved from the database. These are the snippets that have the closest embeddings to the question's embedding.
Augmented Answer Generation
-
LLM (Large Language Model): The retrieved snippets are fed into a large language model along with the original question.
-
Answer: By considering both the user's question and the information contained within the relevant snippets, the language model generates an informed, well structured and contextually relevant answer, like "To do X...".
The user begins by uploading a PDF document through the application's interface. The backend receives this file and prepares it for text processing.
The PreRunProcessor
class takes charge by extracting text from the PDF. It then generates embeddings for the extracted text using the OpenAI API. These embeddings are numerical representations that encapsulate the meaning of the text segments, making them suitable for comparison and retrieval tasks.
When the user submits a query about the PDF's content, the IntentService
class performs two critical functions:
- Malicious Intent Detection: Utilizes OpenAI's moderation model to ensure the query does not contain inappropriate content.
- Relatedness Check: Converts the user's query into embeddings and checks if the query is relevant to the content of the PDF. This is done by comparing the query embeddings against the stored text embeddings from the PDF.
Upon establishing the relevance of the query to the PDF content, the InformationRetrievalService
class employs a cosine similarity search to find the most pertinent text segment that matches the query embeddings. Cosine similarity is a metric used to measure how similar the embeddings are, thus identifying the segment of the PDF that best addresses the user's query.
Finally, the ResponseService
class uses the information retrieved to generate a response to the user's query. It leverages OpenAI's ChatCompletion API to formulate an answer that is not only relevant but also articulated in a clear and informative manner.
This guide will walk you through setting up the "Talk to Your PDF" project from start to finish. Follow these steps to clone the repository, set up your environment, and start interacting with your PDFs in a new way.
-
Pre-Requisites
Before starting, ensure you have the following:
- Openai API account: Needed for the OpenAI API key.
- Python: Make sure Python is installed on your machine.
- Google Cloud Platform (GCP) Account: Needed for Google Drive API setup.
- Supabase Account: Required for database setup.
-
Clone the repository
Clone the "Talk to Your PDF" repository and navigate into the project directory
git clone https://github.com/semasuka/Talk-to-your-PDF cd Talk-to-your-PDF/
-
Create a Virtual Environment
Isolate your project's Python dependencies by creating a virtual environment
python -m venv venv source venv/bin/activate # Unix/macOS venv\Scripts\activate # Windows
-
Install dependencies
Install the necessary Python packages for the project
pip install -r requirements.txt
-
Environment Variables Setup
Add to your
venv
file in the project directory to securely store your environment variables. Initially, it will look like thisGOOGLE_APPLICATION_CREDENTIALS=path/to/your/google-credentials.json SUPABASE_POSTGRES_URL=your_supabase_postgres_url_here OPENAI_API_KEY=your_openai_api_key_here
-
Google drive API setup
-
Create a Project in Google Cloud Platform (GCP):
- Go to the Google Cloud Console.
- Click on "Select a project" at the top of the dashboard and then "New Project".
- Enter a project name and click "Create".
-
Enable Google Drive API:
- In the dashboard of your new project, navigate to "APIs & Services" > "Dashboard".
- Click on "+ ENABLE APIS AND SERVICES".
- Search for "Google Drive API" and enable it for your project.
-
Create Credentials:
- Go to "APIs & Services" > "Credentials".
- Click on "Create credentials" and select "Service account".
- Fill out the service account details and grant it a role of "Editor" (or a more restricted role if you know the specific permissions you need).
- Click "Create" and then "Done".
-
Generate a Key for the Service Account:
- Find your new service account in the "Credentials" page and click on it.
- Go to the "Keys" tab.
- Click on "Add Key" and choose "Create new key".
- Select "JSON" as the key type and click "Create".
- The JSON key file will be automatically downloaded. This file contains your API key.
- Important: Update the GOOGLE_APPLICATION_CREDENTIALS in your .env file with the path to this downloaded JSON file.
-
Configure the Python Script:
- Open the JSON key file with a text editor and find the
client_email
andprivate_key
fields. - In your Python script or environment, set the
client_email
andprivate_key
as environment variables, or directly insert them into thecredentials_info
dictionary within theupload_to_google_drive
function.
- Open the JSON key file with a text editor and find the
-
Set Environment Variables:
- For security reasons, it's best practice to use environment variables for sensitive information like API keys.
- You can set environment variables in your system, or use a
.env
file and load them withload_dotenv()
in your script. - Set
GOOGLE_APPLICATION_CREDENTIALS
as an environment variable pointing to the path of the downloaded JSON key file.
-
Install Required Libraries:
- Make sure that all required Python libraries, including
oauth2client
,PyDrive
, andgoogle-auth
, are installed in your environment. You can usually install these usingpip install
.
- Make sure that all required Python libraries, including
-
Use the Credentials in the Script:
- In the script, ensure that the
upload_to_google_drive
function is correctly configured to authenticate with Google using the service account credentials.
- In the script, ensure that the
-
Test the Setup:
- Run your script in a secure, local development environment first to ensure the Google Drive API is being called correctly and the file is being uploaded.
- Make sure to handle the
credentials_dict
properly without exposing your private keys in the source code if you're going to share the script or use it in a public setting.
-
-
Supabase database setup
-
Sign Up or Log In to Supabase
- Navigate to Supabase and sign in or create a new account.
- Once logged in, create a new project by clicking on the 'New Project' button.
-
Project Configuration
- Fill in the necessary details for your project, such as the name and password.
- Wait for the project to be fully deployed. This might take a couple of minutes.
-
Database Setup
- Once the project is ready, go to the SQL editor in the Supabase dashboard of your project.
- You can run SQL commands here to set up your database schemas and extensions.
-
Install pgvector Extension
- Execute the following SQL command to add the
pgvector
extension to your database:CREATE EXTENSION IF NOT EXISTS vector;
- This command sets up the database to handle vector data types which are required for the embeddings.
- Execute the following SQL command to add the
-
Create Table with Vector Column
- Create a table that will store your embeddings with a vector column by running:
CREATE TABLE pdf_holder ( id SERIAL PRIMARY KEY, text TEXT, embedding VECTOR(3072) );
- This table will be used to store the text from the PDFs and their corresponding embeddings.
- Create a table that will store your embeddings with a vector column by running:
-
Retrieve Supabase Credentials
- After setting up your table, go to the 'Settings' > 'Database' section of your project's dashboard.
- Here you can find the
SUPABASE_POSTGRES_URL
which is required for connecting to the database from your application.
-
Set Up Environment Variables
- In your application environment, set the
SUPABASE_POSTGRES_URL
as an environment variable. - This is essential for your application to connect and interact with your Supabase database securely.
- In your application environment, set the
-
Integrate with the Application
- Use the
create_engine
function from thesqlalchemy
library to connect to your Supabase database using the URL from the previous step. - Utilize the database session to execute your operations, such as storing and retrieving vector data.
- Use the
-
Test the Database Connection
- Before proceeding, ensure that your application can connect to the Supabase database and perform operations.
- Try inserting a dummy record or retrieving data from the
pdf_holder
table as a test.
-
Deploy Your Application
- After confirming the database connection, you can proceed to deploy your application.
- Monitor the application logs to ensure there are no connection issues with the Supabase database.
-
-
OpenAI API Setup
- Obtain your OpenAI API key from the OpenAI API platform.
- Fill in the
OPENAI_API_KEY
in your.env
file with your key.
-
Run the app locally
Go to the directory that contains the app and run
streamlit run streamlit_app.py
-
Testing and Validation
To verify your setup:
- Perform a test upload to Google Drive through the app.
- Ensure the app can process queries about the uploaded PDFs correctly.
-
Set up deploy and set up the Streamlit secrets
If the app is running fine locally you can deploy it on Streamlit share and add your OpenAI API key, Supabase URL, and Google drive credentials (JSON file from your service account) to the TOML placeholder. Your TOML key-value pairs should look like this
SUPABASE_POSTGRES_URL="postgresql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>" OPENAI_API_KEY="<OPENAI_API_KEY_PLACEHOLDER>" [google_credentials] type = "service_account" project_id = "<PROJECT_ID_PLACEHOLDER>" private_key_id = "<PRIVATE_KEY_ID_PLACEHOLDER>" private_key = """ -----BEGIN PRIVATE KEY----- <PRIVATE_KEY_CONTENTS_PLACEHOLDER> -----END PRIVATE KEY----- """ client_email = "<CLIENT_EMAIL_PLACEHOLDER>" client_id = "<CLIENT_ID_PLACEHOLDER>" auth_uri = "https://accounts.google.com/o/oauth2/auth" token_uri = "https://oauth2.googleapis.com/token" auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs" client_x509_cert_url = "<CLIENT_X509_CERT_URL_PLACEHOLDER>" universe_domain = "googleapis.com"
-
Deploying and using the Application
Now the app should be deployed to the Streamlit share link, upload PDF files and explore the application's features by asking questions related to the PDF content.
Video to gif tool
├── assets
│ ├── gif_streamlit.gif <- gif file used in the README.
│ ├── how_it_works.png <- a diagram picture to show how the app works and how different services interact with each other.
│ ├── rag_explained.jpeg <- a picture with RAG explanation.
│ ├── talk_to_your_pdf_banner.png <- banner image used in the README.
│
│
├── notebook_repository
│ ├── information_retrieval_service.ipynb <- contains all the logic for information retrieval.
│ ├── intent_service.ipynb <- contains all the logic for question intent check.
│ ├── main.ipynb <- the final file that calls all the other notebooks that holds all the services.
│ ├── pre_run_service.ipynb <- contains all the pre-run processes.
│ ├── response_service.ipynb <- contains all the logic for response process.
│
│
├── .gitignore <- used to ignore certain folder and files that won't be commit to git.
│
│
├── LICENSE <- license file.
│
│
├── README.md <- this readme file.
│
│
├── requirements.txt <- list of all the dependencies (used for Streamlit).
│
│
├── streamlit_app.py <- file with the all codes and Streamlit component for rendering the interface.
Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change. Please ensure to update tests as appropriate.
MIT License
Copyright (c) 2024 Stern Semasuka
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Learn more about MIT license