Commit ac5f4a9
fix: Improve chunking robustness and type safety
This commit addresses three critical issues discovered during investigation
of poor search result accuracy and chunking behavior:
## 1. Fix Oversized Sentence Handling in Chunking (Issue #1)
- **Problem**: Markdown tables and long sentences caused chunks up to 24,654 chars
(exceeding WatsonX embedding 512-token limit)
- **Root cause**: sentence_based_chunking() added entire sentences regardless of size
- **Fix**: Split oversized sentences at word boundaries before adding to chunks
- **Impact**: Max chunk reduced from 24,654 to 596 chars (~238 tokens)
- **File**: backend/rag_solution/data_ingestion/chunking.py:195-217
## 2. Fix Configuration Consistency Across Chunking Strategies (Issue #2)
- **Problem**: sentence_chunker() multiplied config by 2.5 (assumed tokens),
while other strategies used values as characters directly
- **Root cause**: Inconsistent interpretation across chunking strategies
- **Fix**: Standardized ALL strategies to use CHARACTERS, removed 2.5x multiplier
- **Impact**: Predictable, maintainable configuration across all strategies
- **File**: backend/rag_solution/data_ingestion/chunking.py:409-414
## 3. Fix Type Safety in LLM Model Repository (Issue #3)
- **Problem**: update_model() used duck-typing with hasattr() and dict type erasure
- **Root cause**: Poor type safety, no IDE autocomplete, runtime errors possible
- **Fix**: Changed to only accept LLMModelInput Pydantic type, use model_dump(exclude_unset=True)
- **Impact**: Better type checking, maintainability, IDE support
- **File**: backend/rag_solution/repository/llm_model_repository.py:69-92
## 4. Add Strict Typing Guidelines (New)
- Comprehensive documentation for type safety best practices
- Covers Pydantic models, type hints, mypy configuration
- **File**: docs/development/backend/strict-typing-guidelines.md
## Testing
- Chunking: Validated max chunk size reduced from 24,654 to 596 chars
- Type safety: All mypy checks pass
- Embedding comparison: Tested 8 models (IBM Slate, Granite, E5, MiniLM)
## Related Issues
- Addresses root causes discovered while investigating GitHub #461 (CoT reasoning)
- Created follow-up issues: #465-473 for remaining search accuracy problems
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>1 parent 5c0e487 commit ac5f4a9
File tree
3 files changed
+475
-15
lines changed- backend/rag_solution
- data_ingestion
- repository
- docs/development/backend
3 files changed
+475
-15
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
195 | | - | |
196 | | - | |
197 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
198 | 225 | | |
199 | 226 | | |
200 | 227 | | |
| |||
204 | 231 | | |
205 | 232 | | |
206 | 233 | | |
207 | | - | |
| 234 | + | |
| 235 | + | |
208 | 236 | | |
209 | | - | |
| 237 | + | |
210 | 238 | | |
211 | 239 | | |
212 | 240 | | |
213 | 241 | | |
214 | 242 | | |
215 | 243 | | |
216 | 244 | | |
217 | | - | |
| 245 | + | |
218 | 246 | | |
219 | 247 | | |
220 | 248 | | |
| |||
368 | 396 | | |
369 | 397 | | |
370 | 398 | | |
371 | | - | |
| 399 | + | |
| 400 | + | |
372 | 401 | | |
373 | 402 | | |
374 | 403 | | |
375 | | - | |
| 404 | + | |
376 | 405 | | |
377 | 406 | | |
378 | 407 | | |
379 | 408 | | |
380 | | - | |
381 | | - | |
382 | | - | |
383 | | - | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
384 | 413 | | |
385 | 414 | | |
386 | 415 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | | - | |
| 69 | + | |
70 | 70 | | |
71 | 71 | | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
72 | 76 | | |
73 | 77 | | |
74 | 78 | | |
| |||
78 | 82 | | |
79 | 83 | | |
80 | 84 | | |
81 | | - | |
82 | | - | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
83 | 88 | | |
84 | 89 | | |
85 | 90 | | |
| |||
0 commit comments