Skip to content

Commit 5485537

Browse files
manavgupclaude
andauthored
fix: Improve chunking robustness and type safety (#474)
* fix: Improve chunking robustness and type safety This commit addresses three critical issues discovered during investigation of poor search result accuracy and chunking behavior: ## 1. Fix Oversized Sentence Handling in Chunking (Issue #1) - **Problem**: Markdown tables and long sentences caused chunks up to 24,654 chars (exceeding WatsonX embedding 512-token limit) - **Root cause**: sentence_based_chunking() added entire sentences regardless of size - **Fix**: Split oversized sentences at word boundaries before adding to chunks - **Impact**: Max chunk reduced from 24,654 to 596 chars (~238 tokens) - **File**: backend/rag_solution/data_ingestion/chunking.py:195-217 ## 2. Fix Configuration Consistency Across Chunking Strategies (Issue #2) - **Problem**: sentence_chunker() multiplied config by 2.5 (assumed tokens), while other strategies used values as characters directly - **Root cause**: Inconsistent interpretation across chunking strategies - **Fix**: Standardized ALL strategies to use CHARACTERS, removed 2.5x multiplier - **Impact**: Predictable, maintainable configuration across all strategies - **File**: backend/rag_solution/data_ingestion/chunking.py:409-414 ## 3. Fix Type Safety in LLM Model Repository (Issue #3) - **Problem**: update_model() used duck-typing with hasattr() and dict type erasure - **Root cause**: Poor type safety, no IDE autocomplete, runtime errors possible - **Fix**: Changed to only accept LLMModelInput Pydantic type, use model_dump(exclude_unset=True) - **Impact**: Better type checking, maintainability, IDE support - **File**: backend/rag_solution/repository/llm_model_repository.py:69-92 ## 4. Add Strict Typing Guidelines (New) - Comprehensive documentation for type safety best practices - Covers Pydantic models, type hints, mypy configuration - **File**: docs/development/backend/strict-typing-guidelines.md ## Testing - Chunking: Validated max chunk size reduced from 24,654 to 596 chars - Type safety: All mypy checks pass - Embedding comparison: Tested 8 models (IBM Slate, Granite, E5, MiniLM) ## Related Issues - Addresses root causes discovered while investigating GitHub #461 (CoT reasoning) - Created follow-up issues: #465-473 for remaining search accuracy problems 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Address PR feedback - type safety, tests, and error handling This commit addresses all feedback from PR #474: ## 1. Fix Type Mismatch in Service Layer (CRITICAL) - **Problem**: llm_model_service.py passed dict[str, Any] to repository expecting LLMModelInput - **Fix**: Convert dict updates to LLMModelInput Pydantic models in service layer - **Files**: backend/rag_solution/services/llm_model_service.py:67, 102-133 - **Impact**: Type-safe service-to-repository communication, prevents runtime errors ## 2. Add Rollback for All Exception Types - **Problem**: NotFoundError/ValidationError raised without rollback (potential transaction leaks) - **Fix**: Added explicit rollback for all exception types in update_model() - **File**: backend/rag_solution/repository/llm_model_repository.py:96-99 - **Impact**: Safer transaction handling, prevents DB inconsistencies ## 3. Handle Empty Strings After .strip() - **Problem**: Oversized sentence splitting could create empty chunks after stripping - **Fix**: Added check to skip empty chunks (line 215-216) - **File**: backend/rag_solution/data_ingestion/chunking.py:214-216 - **Impact**: Prevents empty chunks from being stored in vector DB ## 4. Add Unit Tests for Oversized Sentence Splitting - **New Tests**: 5 comprehensive tests for Issue #1 fix - test_oversized_sentence_splits_at_word_boundaries - test_markdown_table_splits_correctly - test_very_long_sentence_without_spaces - test_normal_sentences_not_affected - test_empty_string_chunks_are_filtered - **File**: backend/tests/unit/test_chunking.py:29-102 - **Coverage**: Edge cases for oversized sentences, markdown tables, whitespace handling ## Testing - All linting checks pass (ruff, mypy type annotations added) - All modified files pass type checking - New unit tests validate oversized sentence handling Addresses feedback from: #474 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * refactor: Replace dict[str, Any] with strongly-typed Update schemas ## Problem API contracts used weakly-typed `dict[str, Any]` for partial updates, violating the project's strong typing principles. This created multiple issues: 1. **Type Safety**: dict[str, Any] bypasses Pydantic validation 2. **Partial Updates**: No clear API contract for which fields are optional 3. **Duplicate Queries**: Service layer fetched models, then repository fetched again 4. **Return Type Ambiguity**: Methods returned `Output | None` instead of raising exceptions ## Solution Created strongly-typed Update schemas with all-optional fields for partial updates: ### LLM Models - Created `LLMModelUpdate` schema (all fields optional) - Updated `LLMModelRepository.update_model()` to accept `LLMModelUpdate` - Updated `LLMModelService.update_model()` to use `LLMModelUpdate` - Updated router to accept `LLMModelUpdate` instead of `dict` - Changed return types from `Output | None` to `Output` (let NotFoundError propagate) ### LLM Providers - Created `LLMProviderUpdate` schema (all fields optional) - Updated `LLMProviderRepository.update_provider()` to accept `LLMProviderUpdate` - Updated `LLMProviderService.update_provider()` to use `LLMProviderUpdate` - Updated router to accept `LLMProviderUpdate` instead of `dict` - Changed return type from `Output | None` to `Output` ### Service Layer Improvements - Removed duplicate DB fetches (repository now handles single query) - Eliminated manual field merging (Pydantic's `exclude_unset=True` handles it) - Simplified service methods from 30 lines to 1-3 lines - All methods now use proper exception propagation instead of returning None ## Impact - **Type Safety**: Full Pydantic validation on all partial updates - **Performance**: Eliminated N+1 queries (2 DB calls → 1 DB call) - **Maintainability**: Update schemas automatically track model changes - **API Clarity**: Clear contract for partial updates via Update schemas ## Files Changed - rag_solution/schemas/llm_model_schema.py: Added LLMModelUpdate - rag_solution/schemas/llm_provider_schema.py: Added LLMProviderUpdate - rag_solution/repository/llm_model_repository.py: Use LLMModelUpdate - rag_solution/repository/llm_provider_repository.py: Use LLMProviderUpdate - rag_solution/services/llm_model_service.py: Simplified with LLMModelUpdate - rag_solution/services/llm_provider_service.py: Simplified with LLMProviderUpdate - rag_solution/router/llm_provider_router.py: Use Update schemas Addresses PR feedback: #474 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * security: Remove debug statements that expose API keys to stdout CRITICAL SECURITY FIX - This is a blocking issue that must be merged. ## Problem Lines 71 and 73 in llm_provider_schema.py contained debug print statements that would expose API keys to stdout in plain text: ```python print(f"DEBUG: Converting API key '{v}' to SecretStr") print(f"DEBUG: API key is not a string: {type(v)} = {v}") ``` This violates the principle of keeping secrets secure and could lead to: - API keys appearing in application logs - Secrets exposed in container stdout - Credentials leaked in CI/CD pipeline logs - Security audit failures ## Solution - Removed all debug print statements from convert_api_key_to_secret_str() - Added proper type hints: `def convert_api_key_to_secret_str(cls, v: str | SecretStr) -> SecretStr` - Added comprehensive docstring explaining function behavior - Verified with mypy (all checks pass) ## Impact - Eliminates API key exposure risk - Fixes mypy type checking error (Function is missing a type annotation) - Maintains all existing functionality (SecretStr conversion) - No breaking changes to API or behavior Addresses blocking concern from PR review: #474 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 5c0e487 commit 5485537

File tree

10 files changed

+778
-86
lines changed

10 files changed

+778
-86
lines changed

backend/rag_solution/data_ingestion/chunking.py

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -192,9 +192,38 @@ def sentence_based_chunking(
192192
for sentence in sentences:
193193
sentence_len = len(sentence)
194194

195-
# Check if adding this sentence would exceed target
196-
if current_char_count + sentence_len > target_chars and current_chunk:
197-
# Save current chunk
195+
# Handle oversized sentences by splitting them
196+
if sentence_len > target_chars:
197+
# Save current chunk first if not empty
198+
if current_chunk:
199+
chunk_text = " ".join(current_chunk)
200+
chunks.append(chunk_text)
201+
current_chunk = []
202+
current_char_count = 0
203+
204+
# Split oversized sentence into target-sized pieces
205+
start = 0
206+
while start < sentence_len:
207+
end = min(start + target_chars, sentence_len)
208+
# Try to break at word boundary
209+
if end < sentence_len:
210+
last_space = sentence[start:end].rfind(" ")
211+
if last_space > target_chars * 0.5: # At least 50% full
212+
end = start + last_space
213+
214+
chunk_piece = sentence[start:end].strip()
215+
if chunk_piece: # Only append non-empty chunks
216+
chunks.append(chunk_piece)
217+
start = end
218+
219+
continue
220+
221+
# Account for space between sentences when joining
222+
space_len = 1 if current_chunk else 0
223+
224+
# STRICT: Don't add sentence if it would exceed target
225+
if current_char_count + space_len + sentence_len > target_chars and current_chunk:
226+
# Save current chunk (don't add the sentence that would exceed)
198227
chunk_text = " ".join(current_chunk)
199228
chunks.append(chunk_text)
200229

@@ -204,17 +233,18 @@ def sentence_based_chunking(
204233

205234
for i in range(len(current_chunk) - 1, -1, -1):
206235
sent_len = len(current_chunk[i])
207-
if overlap_count + sent_len <= overlap_chars:
236+
space = 1 if overlap_chunk else 0
237+
if overlap_count + space + sent_len <= overlap_chars:
208238
overlap_chunk.insert(0, current_chunk[i])
209-
overlap_count += sent_len
239+
overlap_count += sent_len + space
210240
else:
211241
break
212242

213243
current_chunk = overlap_chunk
214244
current_char_count = overlap_count
215245

216246
current_chunk.append(sentence)
217-
current_char_count += sentence_len
247+
current_char_count += sentence_len + space_len
218248

219249
# Add final chunk if it meets minimum size
220250
if current_chunk:
@@ -368,19 +398,20 @@ def hierarchical_chunker_wrapper(text: str, settings: Settings = get_settings())
368398
def sentence_chunker(text: str, settings: Settings = get_settings()) -> list[str]:
369399
"""Sentence-based chunking using settings configuration.
370400
371-
Uses conservative character-to-token ratio (2.5:1) for IBM Slate safety.
401+
All config values (min_chunk_size, max_chunk_size, chunk_overlap) are in CHARACTERS.
402+
Conservative char-to-token ratio (2.5:1) provides safety margin for IBM Slate 512-token limit.
372403
373404
Args:
374405
text: Input text to chunk
375-
settings: Configuration settings
406+
settings: Configuration settings (all values in characters)
376407
377408
Returns:
378409
List of sentence-based chunks
379410
"""
380-
# Convert config values assuming they're in tokens, multiply by 2.5 for chars
381-
target_chars = int(settings.max_chunk_size * 2.5) if settings.max_chunk_size < 1000 else 750
382-
overlap_chars = int(settings.chunk_overlap * 2.5) if settings.chunk_overlap < 200 else 100
383-
min_chars = int(settings.min_chunk_size * 2.5) if settings.min_chunk_size < 500 else 500
411+
# Use config values directly as characters (no conversion needed)
412+
target_chars = settings.max_chunk_size
413+
overlap_chars = settings.chunk_overlap
414+
min_chars = settings.min_chunk_size
384415

385416
return sentence_based_chunking(text, target_chars=target_chars, overlap_chars=overlap_chars, min_chars=min_chars)
386417

backend/rag_solution/repository/llm_model_repository.py

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from rag_solution.core.exceptions import AlreadyExistsError, NotFoundError, ValidationError
88
from rag_solution.models.llm_model import LLMModel
9-
from rag_solution.schemas.llm_model_schema import LLMModelInput, LLMModelOutput, ModelType
9+
from rag_solution.schemas.llm_model_schema import LLMModelInput, LLMModelOutput, LLMModelUpdate, ModelType
1010

1111

1212
class LLMModelRepository:
@@ -66,20 +66,29 @@ def get_models_by_type(self, model_type: ModelType) -> list[LLMModelOutput]:
6666
except Exception:
6767
raise
6868

69-
def update_model(self, model_id: UUID4, updates: dict) -> LLMModelOutput:
70-
"""Updates model details.
69+
def update_model(self, model_id: UUID4, updates: LLMModelUpdate) -> LLMModelOutput:
70+
"""Updates model details with partial updates.
71+
72+
Args:
73+
model_id: ID of the model to update
74+
updates: LLMModelUpdate with optional fields for partial updates
7175
7276
Raises:
7377
NotFoundError: If model not found
78+
79+
Note:
80+
Only updates fields that are explicitly set in the updates object.
81+
Uses Pydantic's exclude_unset=True to handle partial updates.
7482
"""
7583
try:
7684
# Find the model first
7785
model = self.session.query(LLMModel).filter_by(id=model_id).first()
7886
if not model:
7987
raise NotFoundError(resource_type="LLMModel", resource_id=str(model_id))
8088

81-
# Apply updates
82-
for key, value in updates.items():
89+
# Update only fields that were explicitly set (partial update support)
90+
update_data = updates.model_dump(exclude_unset=True)
91+
for key, value in update_data.items():
8392
setattr(model, key, value)
8493

8594
self.session.commit()
@@ -89,6 +98,8 @@ def update_model(self, model_id: UUID4, updates: dict) -> LLMModelOutput:
8998
self.session.rollback()
9099
raise AlreadyExistsError(resource_type="LLMModel", field="id", value=str(model_id)) from e
91100
except (NotFoundError, AlreadyExistsError, ValidationError):
101+
# Rollback for safety even though these are typically raised before DB changes
102+
self.session.rollback()
92103
raise
93104
except Exception:
94105
self.session.rollback()

backend/rag_solution/repository/llm_provider_repository.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from rag_solution.core.exceptions import AlreadyExistsError, NotFoundError, ValidationError
88
from rag_solution.models.llm_provider import LLMProvider
9-
from rag_solution.schemas.llm_provider_schema import LLMProviderInput
9+
from rag_solution.schemas.llm_provider_schema import LLMProviderInput, LLMProviderUpdate
1010

1111

1212
class LLMProviderRepository:
@@ -102,22 +102,33 @@ def get_provider_by_name_with_credentials(self, name: str) -> LLMProvider:
102102
except Exception:
103103
raise
104104

105-
def update_provider(self, provider_id: UUID4, updates: dict) -> LLMProvider:
106-
"""Updates provider details.
105+
def update_provider(self, provider_id: UUID4, updates: LLMProviderUpdate) -> LLMProvider:
106+
"""Updates provider details with partial updates.
107+
108+
Args:
109+
provider_id: ID of the provider to update
110+
updates: LLMProviderUpdate with optional fields for partial updates
107111
108112
Raises:
109113
NotFoundError: If provider not found
114+
115+
Note:
116+
Only updates fields that are explicitly set in the updates object.
117+
Uses Pydantic's exclude_unset=True to handle partial updates.
110118
"""
111119
try:
112-
# Handle SecretStr in updates
113-
if "api_key" in updates and hasattr(updates["api_key"], "get_secret_value"):
114-
updates["api_key"] = updates["api_key"].get_secret_value()
115-
116120
# Find the provider first - this will raise NotFoundError if not found
117121
provider = self.get_provider_by_id(provider_id)
118122

123+
# Convert Pydantic model to dict, only including explicitly set fields
124+
update_data = updates.model_dump(exclude_unset=True)
125+
126+
# Handle SecretStr in updates
127+
if "api_key" in update_data and hasattr(update_data["api_key"], "get_secret_value"):
128+
update_data["api_key"] = update_data["api_key"].get_secret_value()
129+
119130
# Apply updates
120-
for key, value in updates.items():
131+
for key, value in update_data.items():
121132
setattr(provider, key, value)
122133

123134
self.session.commit()
@@ -128,6 +139,7 @@ def update_provider(self, provider_id: UUID4, updates: dict) -> LLMProvider:
128139
self.session.rollback()
129140
raise AlreadyExistsError(resource_type="LLMProvider", field="name", value=str(provider_id)) from e
130141
except (NotFoundError, AlreadyExistsError, ValidationError):
142+
self.session.rollback()
131143
raise
132144
except Exception:
133145
self.session.rollback()

backend/rag_solution/router/llm_provider_router.py

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55
from sqlalchemy.orm import Session
66

77
from rag_solution.file_management.database import get_db
8-
from rag_solution.schemas.llm_model_schema import LLMModelInput, LLMModelOutput, ModelType
9-
from rag_solution.schemas.llm_provider_schema import LLMProviderInput, LLMProviderOutput
8+
from rag_solution.schemas.llm_model_schema import LLMModelInput, LLMModelOutput, LLMModelUpdate, ModelType
9+
from rag_solution.schemas.llm_provider_schema import LLMProviderInput, LLMProviderOutput, LLMProviderUpdate
1010
from rag_solution.services.llm_provider_service import LLMProviderService
1111

1212
router = APIRouter(
@@ -58,15 +58,15 @@ def get_provider(provider_id: UUID4, service: Annotated[LLMProviderService, Depe
5858

5959
@router.put("/{provider_id}", response_model=LLMProviderOutput)
6060
def update_provider(
61-
provider_id: UUID4, updates: dict, service: Annotated[LLMProviderService, Depends(get_service)]
61+
provider_id: UUID4, updates: LLMProviderUpdate, service: Annotated[LLMProviderService, Depends(get_service)]
6262
) -> LLMProviderOutput:
6363
"""
64-
Update a specific LLM Provider.
64+
Update a specific LLM Provider with partial updates.
65+
66+
Accepts LLMProviderUpdate with optional fields for partial updates.
67+
Raises 404 if provider not found.
6568
"""
66-
provider = service.update_provider(provider_id, updates)
67-
if not provider:
68-
raise HTTPException(status_code=404, detail="Provider not found")
69-
return provider
69+
return service.update_provider(provider_id, updates)
7070

7171

7272
@router.delete("/{provider_id}")
@@ -122,24 +122,23 @@ def get_models_by_type(
122122
def get_model_by_id(model_id: UUID4, service: Annotated[LLMProviderService, Depends(get_service)]) -> LLMModelOutput:
123123
"""
124124
Get a specific Model by ID.
125+
126+
Raises 404 if model not found.
125127
"""
126-
model = service.get_model_by_id(model_id)
127-
if not model:
128-
raise HTTPException(status_code=404, detail="Model not found")
129-
return model
128+
return service.get_model_by_id(model_id)
130129

131130

132131
@router.put("/models/{model_id}", response_model=LLMModelOutput)
133132
def update_model(
134-
model_id: UUID4, updates: dict, service: Annotated[LLMProviderService, Depends(get_service)]
133+
model_id: UUID4, updates: LLMModelUpdate, service: Annotated[LLMProviderService, Depends(get_service)]
135134
) -> LLMModelOutput:
136135
"""
137-
Update a specific Model.
136+
Update a specific Model with partial updates.
137+
138+
Accepts LLMModelUpdate with optional fields for partial updates.
139+
Raises 404 if model not found.
138140
"""
139-
model = service.update_model(model_id, updates)
140-
if not model:
141-
raise HTTPException(status_code=404, detail="Model not found")
142-
return model
141+
return service.update_model(model_id, updates)
143142

144143

145144
@router.delete("/models/{model_id}")

backend/rag_solution/schemas/llm_model_schema.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,29 @@ class LLMModelInput(BaseModel):
2727
model_config = ConfigDict(protected_namespaces=())
2828

2929

30+
class LLMModelUpdate(BaseModel):
31+
"""Schema for partial updates to LLM models.
32+
33+
All fields are optional to support partial updates from API.
34+
Use exclude_unset=True when converting to dict to only update provided fields.
35+
"""
36+
37+
model_id: str | None = None
38+
default_model_id: str | None = None
39+
model_type: ModelType | None = None
40+
timeout: int | None = None
41+
max_retries: int | None = None
42+
batch_size: int | None = None
43+
retry_delay: float | None = None
44+
concurrency_limit: int | None = None
45+
stream: bool | None = None
46+
rate_limit: int | None = None
47+
is_default: bool | None = None
48+
is_active: bool | None = None
49+
50+
model_config = ConfigDict(protected_namespaces=())
51+
52+
3053
class LLMModelOutput(BaseModel):
3154
id: UUID4
3255
provider_id: UUID4

backend/rag_solution/schemas/llm_provider_schema.py

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,23 @@ class LLMProviderInput(BaseModel):
1616
user_id: UUID4 | None = Field(None, description="User ID who owns this provider")
1717

1818

19+
class LLMProviderUpdate(BaseModel):
20+
"""Schema for partial updates to LLM providers.
21+
22+
All fields are optional to support partial updates from API.
23+
Use exclude_unset=True when converting to dict to only update provided fields.
24+
"""
25+
26+
name: str | None = None
27+
base_url: str | None = None
28+
api_key: SecretStr | None = None
29+
org_id: str | None = None
30+
project_id: str | None = None
31+
is_active: bool | None = None
32+
is_default: bool | None = None
33+
user_id: UUID4 | None = None
34+
35+
1936
class LLMProviderOutput(BaseModel):
2037
"""Schema for returning an LLM Provider."""
2138

@@ -48,12 +65,17 @@ class LLMProviderConfig(BaseModel):
4865

4966
@field_validator("api_key", mode="before")
5067
@classmethod
51-
def convert_api_key_to_secret_str(cls, v):
52-
"""Convert string API key to SecretStr."""
68+
def convert_api_key_to_secret_str(cls, v: str | SecretStr) -> SecretStr:
69+
"""Convert string API key to SecretStr.
70+
71+
Args:
72+
v: API key as string or SecretStr
73+
74+
Returns:
75+
SecretStr: Secured API key
76+
"""
5377
if isinstance(v, str):
54-
print(f"DEBUG: Converting API key '{v}' to SecretStr")
5578
return SecretStr(v)
56-
print(f"DEBUG: API key is not a string: {type(v)} = {v}")
5779
return v
5880

5981
model_config = ConfigDict(from_attributes=True)

0 commit comments

Comments
 (0)