This project provides a Flask-based web service for uploading PDF documents, extracting text using OCR, and setting up a question-answering (QA) system using language models and embeddings. The service processes uploaded PDFs, stores embeddings in a Chroma database, and allows users to query the processed documents.
- PDF Upload: Upload PDF documents through a REST API endpoint.
- Text Extraction: Extract text from PDF files using PaddleOCR.
- Text Splitting: Split extracted text into manageable chunks.
- Embedding: Convert text chunks into embeddings using HuggingFace's model.
- Database Storage: Store embeddings in a Chroma vector database.
- Question Answering: Query the processed documents using a custom QA chain.
- Python 3.7 or higher
- Required Python packages (see
requirements.txt
)
-
Clone the repository:
git clone https://github.com/gautamraj8044/PDF-Document-Processing-and-QA-Bot
-
Navigate to the project directory:
cd Chat-Bot
-
Install the required Python packages:
pip install -r requirements.txt
-
Set up Poppler: Download and install Poppler, and update the
poppler_path
variable in the code to point to your Poppler installation directory. -
Configure Model Paths: Update the
local_llm
variable with the path to your local language model file.
-
Start the Flask server:
python app.py
-
Access the API:
- Upload a PDF: POST request to
/upload
with a file attachment. - Ask a Question: POST request to
/ask
with the question in the form data.
- Upload a PDF: POST request to
- Endpoint:
/upload
- Method: POST
- Request: Form-data with a file attachment.
- Response: JSON message indicating the status of the upload and processing.
- Endpoint:
/ask
- Method: POST
- Request: Form-data with the key
query
containing the question. - Response: JSON with the answer to the question.
curl -X POST http://localhost:5000/upload -F "file=@path_to_your_pdf.pdf"
curl -X POST http://localhost:5000/ask -F "query=What is the main topic of the document?"
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to submit issues or pull requests. Please follow the project's coding style and guidelines.
For any questions or issues, please contact [gautamraj8044@gmail.com]