Skip to content

Enhance data_types.py with vector database optimized pydantic models #211

@manavgup

Description

@manavgup

📋 Overview

Enhance the pydantic data models in vectordbs/data_types.py to provide vector database optimized structures that eliminate manual parsing and improve type safety across all vector database implementations.

🎯 Goals

  • Eliminate manual dict parsing in vector DB implementations
  • Provide type-safe request/response models
  • Enable better error handling with structured responses
  • Standardize serialization/deserialization patterns
  • Improve developer experience with IDE completion and validation

🔧 Technical Specifications

New Pydantic Models to Add

1. EmbeddedChunk Model

class EmbeddedChunk(DocumentChunk):
    \"\"\"Chunk guaranteed to have embeddings for vector DB storage\"\"\"
    embeddings: List[float]  # Required, not optional
    
    @classmethod
    def from_chunk(cls, chunk: DocumentChunk) -> 'EmbeddedChunk':
        \"\"\"Convert DocumentChunk to EmbeddedChunk with validation\"\"\"
        
    def to_vector_metadata(self) -> Dict[str, Any]:
        \"\"\"Serialize metadata for vector DB storage\"\"\"
        
    def to_vector_db(self) -> Dict[str, Any]:
        \"\"\"Complete serialization for vector DB insertion\"\"\"

2. Request/Response Models

class DocumentIngestionRequest(BaseModel):
    \"\"\"Request for adding documents to vector DB\"\"\"
    collection_name: str
    documents: List[Document] 
    batch_size: Optional[int] = 100
    
    def get_embedded_chunks(self) -> List[EmbeddedChunk]:
        \"\"\"Extract all chunks that have embeddings\"\"\"

class VectorSearchRequest(BaseModel):
    \"\"\"Standardized search request\"\"\"
    collection_name: str
    query: Union[str, List[float]]  # Text or embedding vector
    filters: Optional[DocumentMetadataFilter] = None
    limit: int = 10
    include_metadata: bool = True
    
class CollectionConfig(BaseModel):
    \"\"\"Vector DB collection configuration\"\"\"
    name: str
    dimension: int
    metric: VectorMetric = VectorMetric.COSINE
    cloud_provider: Optional[str] = None
    region: Optional[str] = None
    
    def validate_for_vector_db(self, db_type: str) -> None:
        \"\"\"Validate configuration for specific vector DB type\"\"\"

3. Generic Response Models

class VectorDBResponse(BaseModel, Generic[T]):
    \"\"\"Standardized response wrapper for all vector DB operations\"\"\"
    success: bool
    data: Optional[T] = None
    error: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = None
    
    @classmethod
    def success(cls, data: T, metadata: Optional[Dict] = None) -> 'VectorDBResponse[T]':
        \"\"\"Create success response\"\"\"
        
    @classmethod 
    def error(cls, message: str, metadata: Optional[Dict] = None) -> 'VectorDBResponse[T]':
        \"\"\"Create error response\"\"\"

# Type aliases for common responses
DocumentIngestionResponse = VectorDBResponse[List[str]]
CollectionResponse = VectorDBResponse[str] 
SearchResponse = VectorDBResponse[List[QueryResult]]
HealthCheckResponse = VectorDBResponse[Dict[str, Any]]

✅ Acceptance Criteria

Functional Requirements

  • All new pydantic models validate correctly with proper type hints
  • EmbeddedChunk enforces non-null embeddings at creation time
  • Request models provide convenient methods for data extraction
  • Response models support both success and error states
  • Serialization methods produce vector DB compatible formats
  • Validation methods catch configuration errors early

Technical Requirements

  • All models inherit from BaseModel with proper type annotations
  • Use pydantic v2 features (Field, computed fields, validators)
  • Include comprehensive docstrings with examples
  • Support JSON serialization/deserialization
  • Maintain backward compatibility with existing Document/DocumentChunk usage
  • Add proper repr methods for debugging

Testing Requirements

  • Unit tests for all new models and methods
  • Validation testing for edge cases and error conditions
  • Serialization/deserialization round-trip tests
  • Performance tests for large document batches
  • Integration tests with existing codebase

🔄 Implementation Details

File Changes

  • vectordbs/data_types.py - Primary implementation
  • tests/unit/test_data_types.py - Comprehensive test coverage
  • tests/integration/test_vector_models.py - Integration testing

Dependencies

  • Pydantic v2
  • Python 3.12+ type hints
  • No breaking changes to existing models

Migration Strategy

  • All new models are additive - existing code continues to work
  • Provide utility functions to convert between old and new patterns
  • Gradual adoption in downstream code

🧪 Testing Strategy

Unit Tests

def test_embedded_chunk_validation():
    \"\"\"Test EmbeddedChunk requires embeddings\"\"\"
    
def test_ingestion_request_chunk_filtering():
    \"\"\"Test DocumentIngestionRequest filters embedded chunks correctly\"\"\"
    
def test_response_model_serialization():
    \"\"\"Test VectorDBResponse serializes properly\"\"\"
    
def test_collection_config_validation():
    \"\"\"Test CollectionConfig validates for different vector DBs\"\"\"

📊 Success Metrics

  • Zero manual dict parsing in vector DB implementations after adoption
  • 100% type coverage with mypy
  • <100ms serialization time for 1000 document chunks
  • Backward compatibility maintained for all existing usages

🔗 Related Issues

  • Depends on: None (foundational work)
  • Blocks: Enhanced VectorStore Base Class (TBD)
  • Blocks: Vector DB Implementation Refactoring (TBD)

Priority: High
Estimated Effort: Medium (3-5 days)
Risk Level: Low (additive changes only)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestvector-dbVector database related

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions