-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Labels
enhancementNew feature or requestNew feature or requestvector-dbVector database relatedVector database related
Milestone
Description
📋 Overview
Enhance the pydantic data models in vectordbs/data_types.py to provide vector database optimized structures that eliminate manual parsing and improve type safety across all vector database implementations.
🎯 Goals
- Eliminate manual dict parsing in vector DB implementations
- Provide type-safe request/response models
- Enable better error handling with structured responses
- Standardize serialization/deserialization patterns
- Improve developer experience with IDE completion and validation
🔧 Technical Specifications
New Pydantic Models to Add
1. EmbeddedChunk Model
class EmbeddedChunk(DocumentChunk):
\"\"\"Chunk guaranteed to have embeddings for vector DB storage\"\"\"
embeddings: List[float] # Required, not optional
@classmethod
def from_chunk(cls, chunk: DocumentChunk) -> 'EmbeddedChunk':
\"\"\"Convert DocumentChunk to EmbeddedChunk with validation\"\"\"
def to_vector_metadata(self) -> Dict[str, Any]:
\"\"\"Serialize metadata for vector DB storage\"\"\"
def to_vector_db(self) -> Dict[str, Any]:
\"\"\"Complete serialization for vector DB insertion\"\"\"2. Request/Response Models
class DocumentIngestionRequest(BaseModel):
\"\"\"Request for adding documents to vector DB\"\"\"
collection_name: str
documents: List[Document]
batch_size: Optional[int] = 100
def get_embedded_chunks(self) -> List[EmbeddedChunk]:
\"\"\"Extract all chunks that have embeddings\"\"\"
class VectorSearchRequest(BaseModel):
\"\"\"Standardized search request\"\"\"
collection_name: str
query: Union[str, List[float]] # Text or embedding vector
filters: Optional[DocumentMetadataFilter] = None
limit: int = 10
include_metadata: bool = True
class CollectionConfig(BaseModel):
\"\"\"Vector DB collection configuration\"\"\"
name: str
dimension: int
metric: VectorMetric = VectorMetric.COSINE
cloud_provider: Optional[str] = None
region: Optional[str] = None
def validate_for_vector_db(self, db_type: str) -> None:
\"\"\"Validate configuration for specific vector DB type\"\"\"3. Generic Response Models
class VectorDBResponse(BaseModel, Generic[T]):
\"\"\"Standardized response wrapper for all vector DB operations\"\"\"
success: bool
data: Optional[T] = None
error: Optional[str] = None
metadata: Optional[Dict[str, Any]] = None
@classmethod
def success(cls, data: T, metadata: Optional[Dict] = None) -> 'VectorDBResponse[T]':
\"\"\"Create success response\"\"\"
@classmethod
def error(cls, message: str, metadata: Optional[Dict] = None) -> 'VectorDBResponse[T]':
\"\"\"Create error response\"\"\"
# Type aliases for common responses
DocumentIngestionResponse = VectorDBResponse[List[str]]
CollectionResponse = VectorDBResponse[str]
SearchResponse = VectorDBResponse[List[QueryResult]]
HealthCheckResponse = VectorDBResponse[Dict[str, Any]]✅ Acceptance Criteria
Functional Requirements
- All new pydantic models validate correctly with proper type hints
- EmbeddedChunk enforces non-null embeddings at creation time
- Request models provide convenient methods for data extraction
- Response models support both success and error states
- Serialization methods produce vector DB compatible formats
- Validation methods catch configuration errors early
Technical Requirements
- All models inherit from BaseModel with proper type annotations
- Use pydantic v2 features (Field, computed fields, validators)
- Include comprehensive docstrings with examples
- Support JSON serialization/deserialization
- Maintain backward compatibility with existing Document/DocumentChunk usage
- Add proper repr methods for debugging
Testing Requirements
- Unit tests for all new models and methods
- Validation testing for edge cases and error conditions
- Serialization/deserialization round-trip tests
- Performance tests for large document batches
- Integration tests with existing codebase
🔄 Implementation Details
File Changes
vectordbs/data_types.py- Primary implementationtests/unit/test_data_types.py- Comprehensive test coveragetests/integration/test_vector_models.py- Integration testing
Dependencies
- Pydantic v2
- Python 3.12+ type hints
- No breaking changes to existing models
Migration Strategy
- All new models are additive - existing code continues to work
- Provide utility functions to convert between old and new patterns
- Gradual adoption in downstream code
🧪 Testing Strategy
Unit Tests
def test_embedded_chunk_validation():
\"\"\"Test EmbeddedChunk requires embeddings\"\"\"
def test_ingestion_request_chunk_filtering():
\"\"\"Test DocumentIngestionRequest filters embedded chunks correctly\"\"\"
def test_response_model_serialization():
\"\"\"Test VectorDBResponse serializes properly\"\"\"
def test_collection_config_validation():
\"\"\"Test CollectionConfig validates for different vector DBs\"\"\"📊 Success Metrics
- Zero manual dict parsing in vector DB implementations after adoption
- 100% type coverage with mypy
- <100ms serialization time for 1000 document chunks
- Backward compatibility maintained for all existing usages
🔗 Related Issues
- Depends on: None (foundational work)
- Blocks: Enhanced VectorStore Base Class (TBD)
- Blocks: Vector DB Implementation Refactoring (TBD)
Priority: High
Estimated Effort: Medium (3-5 days)
Risk Level: Low (additive changes only)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestvector-dbVector database relatedVector database related