Invoice Q&A Assistant - Refactored Version

A modular RAG (Retrieval-Augmented Generation) application for querying invoice documents using hierarchical chunking and semantic search.

📁 Project Structure

RAG/
├── app.py                    # Main Streamlit application (refactored)
├── app_original.py          # Original monolithic version (backup)
├── config.py                # Configuration settings and prompts
├── database_manager.py      # ChromaDB operations and collection management
├── document_processor.py    # PDF processing and hierarchical chunking
├── ui_components.py         # Reusable Streamlit UI components
├── utils.py                 # Utility functions and helpers
├── callbacks.py             # Custom callback handlers for progress tracking
├── chroma_db/              # ChromaDB persistent storage
└── README.md               # This file

🏗️ Architecture Overview

Core Modules

app.py - Main application entry point
- InvoiceQAApp class orchestrates the entire application
- Clean separation between UI rendering and business logic
- Modular tab-based interface
config.py - Centralized configuration
- AppConfig: Application settings and constants
- PromptTemplates: LLM prompt templates
- FilterConfig: Database query filters
- UIMessages: Standardized UI text and labels
database_manager.py - Database operations
- DatabaseManager class handles all ChromaDB operations
- Collection management and querying
- Hierarchical retrieval system
- Collection statistics and analysis
document_processor.py - Document processing
- DocumentProcessor class handles PDF processing
- Hierarchical chunking (document + section levels)
- Semantic chunking with progress tracking
- Vector store creation with batch processing
ui_components.py - Reusable UI components
- CollectionSelector: Collection picker interface
- QuickQuestions: Pre-defined question buttons
- CollectionOverview: Collections management tab
- QueryInterface: Query processing and response display
utils.py - Utility functions
- File handling utilities
- Metadata formatting helpers
- Progress tracking utilities
- UI display helpers
callbacks.py - Custom callback handlers
- StreamlitProgressCallback: Real-time progress tracking
- LoggingCallback: Chain execution logging

🚀 Features

Enhanced Functionality

Query without upload: Use existing collections without uploading new files
Collection selector: Easy switching between tenant/document type combinations
Progress tracking: Detailed progress indicators for all operations
Hierarchical chunking: Document and section-level chunks for better retrieval
Duplicate detection: Prevents processing the same document twice
Collection management: Overview and statistics for all collections

Improved Code Quality

Modular architecture: Clean separation of concerns
Type hints: Better code documentation and IDE support
Error handling: Robust error management throughout
Configuration management: Centralized settings and prompts
Reusable components: DRY principle implementation

🔧 Configuration

Key Settings (config.py)

# Database settings
CHROMA_PERSIST_DIR = "./chroma_db"

# Document processing
SEMANTIC_CHUNK_THRESHOLD = 75
RECURSIVE_CHUNK_SIZE = 512
BATCH_SIZE = 50

# LLM settings
DEFAULT_MODEL = "deepseek-r1:8b"

Customizable Prompts

The invoice analysis prompt is centralized in PromptTemplates.INVOICE_QA_PROMPT and can be easily modified for different use cases.

🎯 Usage

Basic Usage

# Run the application
python app.py

Using Individual Modules

# Process documents
from document_processor import DocumentProcessor
processor = DocumentProcessor()
db, documents = processor.process_documents_with_progress(file, tenant_id, doc_type)

# Manage collections
from database_manager import DatabaseManager
db_manager = DatabaseManager()
collections = db_manager.get_available_collections()

# Use UI components
from ui_components import CollectionSelector
selector = CollectionSelector(collections)
tenant_id, doc_type = selector.render()

🔍 Key Improvements

1. Maintainability

Separated concerns into focused modules
Clear interfaces between components
Easy to test individual components

2. Extensibility

Easy to add new document types
Pluggable UI components
Configurable processing parameters

3. Reusability

Components can be used independently
Utility functions are module-agnostic
Clean API design

4. Performance

Efficient batch processing
Progress tracking for better UX
Optimized database operations

📊 Module Dependencies

app.py
├── config.py
├── database_manager.py
│   └── utils.py
├── document_processor.py
│   └── utils.py
├── ui_components.py
│   ├── utils.py
│   └── database_manager.py
└── callbacks.py
    └── config.py

🧪 Testing

Each module can be tested independently:

# Test document processing
from document_processor import DocumentProcessor
processor = DocumentProcessor()

# Test database operations
from database_manager import DatabaseManager
db_manager = DatabaseManager()

# Test utilities
from utils import generate_file_hash, create_collection_name

🤝 Contributing

When adding new features:

Follow the modular pattern: Add functionality to the appropriate module
Update configuration: Add new settings to config.py
Create reusable components: Add UI components to ui_components.py
Add utilities: Common functions go in utils.py
Document changes: Update this README and add docstrings

📝 Migration Notes

From Original Version

All functionality is preserved
Configuration is now centralized
UI components are reusable
Better error handling and progress tracking
The original app_original.py is kept as backup

Breaking Changes

Import structure changed (now uses custom modules)
Some internal function names changed
Configuration moved to config.py

This refactored version maintains all original functionality while providing a much cleaner, more maintainable, and extensible codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Invoice Q&A Assistant - Refactored Version

📁 Project Structure

🏗️ Architecture Overview

Core Modules

🚀 Features

Enhanced Functionality

Improved Code Quality

🔧 Configuration

Key Settings (config.py)

Customizable Prompts

🎯 Usage

Basic Usage

Using Individual Modules

🔍 Key Improvements

1. Maintainability

2. Extensibility

3. Reusability

4. Performance

📊 Module Dependencies

🧪 Testing

🤝 Contributing

📝 Migration Notes

From Original Version

Breaking Changes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
app_original.py		app_original.py
app_simple.py		app_simple.py
callbacks.py		callbacks.py
config.py		config.py
database_manager.py		database_manager.py
document_processor.py		document_processor.py
test_app.py		test_app.py
ui_components.py		ui_components.py
utils.py		utils.py

JonathanAriass/RAG

Folders and files

Latest commit

History

Repository files navigation

Invoice Q&A Assistant - Refactored Version

📁 Project Structure

🏗️ Architecture Overview

Core Modules

🚀 Features

Enhanced Functionality

Improved Code Quality

🔧 Configuration

Key Settings (config.py)

Customizable Prompts

🎯 Usage

Basic Usage

Using Individual Modules

🔍 Key Improvements

1. Maintainability

2. Extensibility

3. Reusability

4. Performance

📊 Module Dependencies

🧪 Testing

🤝 Contributing

📝 Migration Notes

From Original Version

Breaking Changes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages