A Python-based RAG system that processes PDF and DOCX files, creates embeddings for document chunks, uses FAISS for efficient similarity search, and generates responses based on retrieved contexts.
-
Document Processing
- Support for PDF and DOCX files
- Configurable text chunking with overlap
- Automatic handling of document boundaries
-
Embedding Creation
- Uses SentenceTransformer models
- Efficient caching system for embeddings
- Support for different embedding models
-
Similarity Search
- FAISS-based vector search
- Multiple index types (Flat, IVF, HNSW)
- Optimizable search parameters
-
Response Generation
- Context-based response generation
- Multiple context combination strategies
- Configurable context length
-
Optimization Features
- Hyperparameter tuning
- Index parameter optimization
- Performance metrics tracking
- Caching system
- Batch processing support
- Clone the repository:
git clone <repository-url>
cd rag-system
- Install dependencies:
pip install -r requirements.txt
from rag_system import RAGSystem
# Initialize the system
rag = RAGSystem(
embedding_model='all-mpnet-base-v2',
chunk_size=256,
chunk_overlap=64,
cache_dir='cache'
)
# Add documents
rag.add_documents('path/to/documents')
# Query the system
response = rag.query("What is the main topic of the documents?")
print(response)
The system can process both PDF and DOCX files:
# Process all documents in a directory
rag.add_documents('path/to/documents')
# Process a single document
rag.add_document('path/to/document.pdf')
# Simple query
response = rag.query("What are the key findings?")
# Query with metadata
response = rag.query(
"What methodology was used?",
k=5, # Number of contexts to retrieve
include_metadata=True
)
The system supports different FAISS index types:
# Flat index (exact search)
rag = RAGSystem(index_type='flat')
# IVF index (approximate search with clustering)
rag = RAGSystem(index_type='ivf')
# HNSW index (graph-based approximate search)
rag = RAGSystem(index_type='hnsw')
Choose how to combine multiple relevant contexts:
# Concatenate all relevant contexts
rag = RAGSystem(context_strategy='concatenate')
# Use only the best matching context
rag = RAGSystem(context_strategy='best_match')
# Optimize retrieval parameters
best_params = rag.optimize_retrieval([
"What are the key findings?",
"What methodology was used?",
"What are the main conclusions?"
])
# Save the index
rag.save_index('path/to/index')
# Load the index
rag.load_index('path/to/index')
- DocumentProcessor: Handles document loading and text chunking
- EmbeddingEngine: Creates and manages embeddings
- RetrievalEngine: Handles FAISS index and similarity search
- ResponseGenerator: Generates responses from retrieved contexts
- RAGSystem: Main class that orchestrates all components
-
Caching
- Enable caching by providing a cache directory
- Embeddings and indices are cached for reuse
- Reduces computation time for repeated operations
-
Chunking Strategy
- Adjust chunk size based on your documents
- Use appropriate overlap for context continuity
- Consider semantic chunking for better results
-
Index Selection
- Flat index: Best for small datasets (<100K documents)
- IVF index: Good for medium datasets
- HNSW index: Best for large datasets
-
Parameter Tuning
- Use optimize_retrieval() for index parameters
- Adjust batch size for large document collections
- Configure context length based on your needs
A comprehensive Jupyter notebook (rag_system.ipynb
) is included that demonstrates:
- System setup and configuration
- Document processing
- Querying and response generation
- System optimization
- Advanced usage examples
The system includes a Gradio-based web interface for easy interaction. To launch the interface:
python gradio_interface.py
The interface provides three main tabs:
-
Upload Documents
- Upload individual PDF/DOCX files
- Process entire directories of documents
- View processing status
-
Configuration
- Adjust chunk size and overlap
- Select index type (Flat, IVF, HNSW)
- Choose context strategy
- Set maximum context length
-
Query
- Enter questions
- Adjust number of contexts to retrieve
- View answers with source documents
- Toggle metadata display
The interface automatically manages document processing, embedding creation, and query handling through an intuitive UI.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.