A modular RAG (Retrieval-Augmented Generation) application for querying invoice documents using hierarchical chunking and semantic search.
RAG/
├── app.py # Main Streamlit application (refactored)
├── app_original.py # Original monolithic version (backup)
├── config.py # Configuration settings and prompts
├── database_manager.py # ChromaDB operations and collection management
├── document_processor.py # PDF processing and hierarchical chunking
├── ui_components.py # Reusable Streamlit UI components
├── utils.py # Utility functions and helpers
├── callbacks.py # Custom callback handlers for progress tracking
├── chroma_db/ # ChromaDB persistent storage
└── README.md # This file
-
app.py- Main application entry pointInvoiceQAAppclass orchestrates the entire application- Clean separation between UI rendering and business logic
- Modular tab-based interface
-
config.py- Centralized configurationAppConfig: Application settings and constantsPromptTemplates: LLM prompt templatesFilterConfig: Database query filtersUIMessages: Standardized UI text and labels
-
database_manager.py- Database operationsDatabaseManagerclass handles all ChromaDB operations- Collection management and querying
- Hierarchical retrieval system
- Collection statistics and analysis
-
document_processor.py- Document processingDocumentProcessorclass handles PDF processing- Hierarchical chunking (document + section levels)
- Semantic chunking with progress tracking
- Vector store creation with batch processing
-
ui_components.py- Reusable UI componentsCollectionSelector: Collection picker interfaceQuickQuestions: Pre-defined question buttonsCollectionOverview: Collections management tabQueryInterface: Query processing and response display
-
utils.py- Utility functions- File handling utilities
- Metadata formatting helpers
- Progress tracking utilities
- UI display helpers
-
callbacks.py- Custom callback handlersStreamlitProgressCallback: Real-time progress trackingLoggingCallback: Chain execution logging
- Query without upload: Use existing collections without uploading new files
- Collection selector: Easy switching between tenant/document type combinations
- Progress tracking: Detailed progress indicators for all operations
- Hierarchical chunking: Document and section-level chunks for better retrieval
- Duplicate detection: Prevents processing the same document twice
- Collection management: Overview and statistics for all collections
- Modular architecture: Clean separation of concerns
- Type hints: Better code documentation and IDE support
- Error handling: Robust error management throughout
- Configuration management: Centralized settings and prompts
- Reusable components: DRY principle implementation
# Database settings
CHROMA_PERSIST_DIR = "./chroma_db"
# Document processing
SEMANTIC_CHUNK_THRESHOLD = 75
RECURSIVE_CHUNK_SIZE = 512
BATCH_SIZE = 50
# LLM settings
DEFAULT_MODEL = "deepseek-r1:8b"The invoice analysis prompt is centralized in PromptTemplates.INVOICE_QA_PROMPT and can be easily modified for different use cases.
# Run the application
python app.py# Process documents
from document_processor import DocumentProcessor
processor = DocumentProcessor()
db, documents = processor.process_documents_with_progress(file, tenant_id, doc_type)
# Manage collections
from database_manager import DatabaseManager
db_manager = DatabaseManager()
collections = db_manager.get_available_collections()
# Use UI components
from ui_components import CollectionSelector
selector = CollectionSelector(collections)
tenant_id, doc_type = selector.render()- Separated concerns into focused modules
- Clear interfaces between components
- Easy to test individual components
- Easy to add new document types
- Pluggable UI components
- Configurable processing parameters
- Components can be used independently
- Utility functions are module-agnostic
- Clean API design
- Efficient batch processing
- Progress tracking for better UX
- Optimized database operations
app.py
├── config.py
├── database_manager.py
│ └── utils.py
├── document_processor.py
│ └── utils.py
├── ui_components.py
│ ├── utils.py
│ └── database_manager.py
└── callbacks.py
└── config.py
Each module can be tested independently:
# Test document processing
from document_processor import DocumentProcessor
processor = DocumentProcessor()
# Test database operations
from database_manager import DatabaseManager
db_manager = DatabaseManager()
# Test utilities
from utils import generate_file_hash, create_collection_nameWhen adding new features:
- Follow the modular pattern: Add functionality to the appropriate module
- Update configuration: Add new settings to
config.py - Create reusable components: Add UI components to
ui_components.py - Add utilities: Common functions go in
utils.py - Document changes: Update this README and add docstrings
- All functionality is preserved
- Configuration is now centralized
- UI components are reusable
- Better error handling and progress tracking
- The original
app_original.pyis kept as backup
- Import structure changed (now uses custom modules)
- Some internal function names changed
- Configuration moved to
config.py
This refactored version maintains all original functionality while providing a much cleaner, more maintainable, and extensible codebase.