Commit 5485537
fix: Improve chunking robustness and type safety (#474)
* fix: Improve chunking robustness and type safety
This commit addresses three critical issues discovered during investigation
of poor search result accuracy and chunking behavior:
## 1. Fix Oversized Sentence Handling in Chunking (Issue #1)
- **Problem**: Markdown tables and long sentences caused chunks up to 24,654 chars
(exceeding WatsonX embedding 512-token limit)
- **Root cause**: sentence_based_chunking() added entire sentences regardless of size
- **Fix**: Split oversized sentences at word boundaries before adding to chunks
- **Impact**: Max chunk reduced from 24,654 to 596 chars (~238 tokens)
- **File**: backend/rag_solution/data_ingestion/chunking.py:195-217
## 2. Fix Configuration Consistency Across Chunking Strategies (Issue #2)
- **Problem**: sentence_chunker() multiplied config by 2.5 (assumed tokens),
while other strategies used values as characters directly
- **Root cause**: Inconsistent interpretation across chunking strategies
- **Fix**: Standardized ALL strategies to use CHARACTERS, removed 2.5x multiplier
- **Impact**: Predictable, maintainable configuration across all strategies
- **File**: backend/rag_solution/data_ingestion/chunking.py:409-414
## 3. Fix Type Safety in LLM Model Repository (Issue #3)
- **Problem**: update_model() used duck-typing with hasattr() and dict type erasure
- **Root cause**: Poor type safety, no IDE autocomplete, runtime errors possible
- **Fix**: Changed to only accept LLMModelInput Pydantic type, use model_dump(exclude_unset=True)
- **Impact**: Better type checking, maintainability, IDE support
- **File**: backend/rag_solution/repository/llm_model_repository.py:69-92
## 4. Add Strict Typing Guidelines (New)
- Comprehensive documentation for type safety best practices
- Covers Pydantic models, type hints, mypy configuration
- **File**: docs/development/backend/strict-typing-guidelines.md
## Testing
- Chunking: Validated max chunk size reduced from 24,654 to 596 chars
- Type safety: All mypy checks pass
- Embedding comparison: Tested 8 models (IBM Slate, Granite, E5, MiniLM)
## Related Issues
- Addresses root causes discovered while investigating GitHub #461 (CoT reasoning)
- Created follow-up issues: #465-473 for remaining search accuracy problems
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Address PR feedback - type safety, tests, and error handling
This commit addresses all feedback from PR #474:
## 1. Fix Type Mismatch in Service Layer (CRITICAL)
- **Problem**: llm_model_service.py passed dict[str, Any] to repository expecting LLMModelInput
- **Fix**: Convert dict updates to LLMModelInput Pydantic models in service layer
- **Files**: backend/rag_solution/services/llm_model_service.py:67, 102-133
- **Impact**: Type-safe service-to-repository communication, prevents runtime errors
## 2. Add Rollback for All Exception Types
- **Problem**: NotFoundError/ValidationError raised without rollback (potential transaction leaks)
- **Fix**: Added explicit rollback for all exception types in update_model()
- **File**: backend/rag_solution/repository/llm_model_repository.py:96-99
- **Impact**: Safer transaction handling, prevents DB inconsistencies
## 3. Handle Empty Strings After .strip()
- **Problem**: Oversized sentence splitting could create empty chunks after stripping
- **Fix**: Added check to skip empty chunks (line 215-216)
- **File**: backend/rag_solution/data_ingestion/chunking.py:214-216
- **Impact**: Prevents empty chunks from being stored in vector DB
## 4. Add Unit Tests for Oversized Sentence Splitting
- **New Tests**: 5 comprehensive tests for Issue #1 fix
- test_oversized_sentence_splits_at_word_boundaries
- test_markdown_table_splits_correctly
- test_very_long_sentence_without_spaces
- test_normal_sentences_not_affected
- test_empty_string_chunks_are_filtered
- **File**: backend/tests/unit/test_chunking.py:29-102
- **Coverage**: Edge cases for oversized sentences, markdown tables, whitespace handling
## Testing
- All linting checks pass (ruff, mypy type annotations added)
- All modified files pass type checking
- New unit tests validate oversized sentence handling
Addresses feedback from: #474 (comment)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* refactor: Replace dict[str, Any] with strongly-typed Update schemas
## Problem
API contracts used weakly-typed `dict[str, Any]` for partial updates, violating
the project's strong typing principles. This created multiple issues:
1. **Type Safety**: dict[str, Any] bypasses Pydantic validation
2. **Partial Updates**: No clear API contract for which fields are optional
3. **Duplicate Queries**: Service layer fetched models, then repository fetched again
4. **Return Type Ambiguity**: Methods returned `Output | None` instead of raising exceptions
## Solution
Created strongly-typed Update schemas with all-optional fields for partial updates:
### LLM Models
- Created `LLMModelUpdate` schema (all fields optional)
- Updated `LLMModelRepository.update_model()` to accept `LLMModelUpdate`
- Updated `LLMModelService.update_model()` to use `LLMModelUpdate`
- Updated router to accept `LLMModelUpdate` instead of `dict`
- Changed return types from `Output | None` to `Output` (let NotFoundError propagate)
### LLM Providers
- Created `LLMProviderUpdate` schema (all fields optional)
- Updated `LLMProviderRepository.update_provider()` to accept `LLMProviderUpdate`
- Updated `LLMProviderService.update_provider()` to use `LLMProviderUpdate`
- Updated router to accept `LLMProviderUpdate` instead of `dict`
- Changed return type from `Output | None` to `Output`
### Service Layer Improvements
- Removed duplicate DB fetches (repository now handles single query)
- Eliminated manual field merging (Pydantic's `exclude_unset=True` handles it)
- Simplified service methods from 30 lines to 1-3 lines
- All methods now use proper exception propagation instead of returning None
## Impact
- **Type Safety**: Full Pydantic validation on all partial updates
- **Performance**: Eliminated N+1 queries (2 DB calls → 1 DB call)
- **Maintainability**: Update schemas automatically track model changes
- **API Clarity**: Clear contract for partial updates via Update schemas
## Files Changed
- rag_solution/schemas/llm_model_schema.py: Added LLMModelUpdate
- rag_solution/schemas/llm_provider_schema.py: Added LLMProviderUpdate
- rag_solution/repository/llm_model_repository.py: Use LLMModelUpdate
- rag_solution/repository/llm_provider_repository.py: Use LLMProviderUpdate
- rag_solution/services/llm_model_service.py: Simplified with LLMModelUpdate
- rag_solution/services/llm_provider_service.py: Simplified with LLMProviderUpdate
- rag_solution/router/llm_provider_router.py: Use Update schemas
Addresses PR feedback: #474 (comment)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* security: Remove debug statements that expose API keys to stdout
CRITICAL SECURITY FIX - This is a blocking issue that must be merged.
## Problem
Lines 71 and 73 in llm_provider_schema.py contained debug print statements
that would expose API keys to stdout in plain text:
```python
print(f"DEBUG: Converting API key '{v}' to SecretStr")
print(f"DEBUG: API key is not a string: {type(v)} = {v}")
```
This violates the principle of keeping secrets secure and could lead to:
- API keys appearing in application logs
- Secrets exposed in container stdout
- Credentials leaked in CI/CD pipeline logs
- Security audit failures
## Solution
- Removed all debug print statements from convert_api_key_to_secret_str()
- Added proper type hints: `def convert_api_key_to_secret_str(cls, v: str | SecretStr) -> SecretStr`
- Added comprehensive docstring explaining function behavior
- Verified with mypy (all checks pass)
## Impact
- Eliminates API key exposure risk
- Fixes mypy type checking error (Function is missing a type annotation)
- Maintains all existing functionality (SecretStr conversion)
- No breaking changes to API or behavior
Addresses blocking concern from PR review:
#474 (comment)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent 5c0e487 commit 5485537
File tree
10 files changed
+778
-86
lines changed- backend
- rag_solution
- data_ingestion
- repository
- router
- schemas
- services
- tests/unit
- docs/development/backend
10 files changed
+778
-86
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
195 | | - | |
196 | | - | |
197 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
198 | 227 | | |
199 | 228 | | |
200 | 229 | | |
| |||
204 | 233 | | |
205 | 234 | | |
206 | 235 | | |
207 | | - | |
| 236 | + | |
| 237 | + | |
208 | 238 | | |
209 | | - | |
| 239 | + | |
210 | 240 | | |
211 | 241 | | |
212 | 242 | | |
213 | 243 | | |
214 | 244 | | |
215 | 245 | | |
216 | 246 | | |
217 | | - | |
| 247 | + | |
218 | 248 | | |
219 | 249 | | |
220 | 250 | | |
| |||
368 | 398 | | |
369 | 399 | | |
370 | 400 | | |
371 | | - | |
| 401 | + | |
| 402 | + | |
372 | 403 | | |
373 | 404 | | |
374 | 405 | | |
375 | | - | |
| 406 | + | |
376 | 407 | | |
377 | 408 | | |
378 | 409 | | |
379 | 410 | | |
380 | | - | |
381 | | - | |
382 | | - | |
383 | | - | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
384 | 415 | | |
385 | 416 | | |
386 | 417 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | | - | |
70 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
71 | 75 | | |
72 | 76 | | |
73 | 77 | | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
74 | 82 | | |
75 | 83 | | |
76 | 84 | | |
77 | 85 | | |
78 | 86 | | |
79 | 87 | | |
80 | 88 | | |
81 | | - | |
82 | | - | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
83 | 92 | | |
84 | 93 | | |
85 | 94 | | |
| |||
89 | 98 | | |
90 | 99 | | |
91 | 100 | | |
| 101 | + | |
| 102 | + | |
92 | 103 | | |
93 | 104 | | |
94 | 105 | | |
| |||
Lines changed: 20 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
106 | | - | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
107 | 111 | | |
108 | 112 | | |
109 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
110 | 118 | | |
111 | 119 | | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | 120 | | |
117 | 121 | | |
118 | 122 | | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
119 | 130 | | |
120 | | - | |
| 131 | + | |
121 | 132 | | |
122 | 133 | | |
123 | 134 | | |
| |||
128 | 139 | | |
129 | 140 | | |
130 | 141 | | |
| 142 | + | |
131 | 143 | | |
132 | 144 | | |
133 | 145 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
9 | | - | |
| 8 | + | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | | - | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
65 | 68 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
| 69 | + | |
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
| 125 | + | |
| 126 | + | |
125 | 127 | | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
| 128 | + | |
130 | 129 | | |
131 | 130 | | |
132 | 131 | | |
133 | 132 | | |
134 | | - | |
| 133 | + | |
135 | 134 | | |
136 | 135 | | |
137 | | - | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
138 | 140 | | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
| 141 | + | |
143 | 142 | | |
144 | 143 | | |
145 | 144 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
30 | 53 | | |
31 | 54 | | |
32 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
19 | 36 | | |
20 | 37 | | |
21 | 38 | | |
| |||
48 | 65 | | |
49 | 66 | | |
50 | 67 | | |
51 | | - | |
52 | | - | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
53 | 77 | | |
54 | | - | |
55 | 78 | | |
56 | | - | |
57 | 79 | | |
58 | 80 | | |
59 | 81 | | |
0 commit comments