Commit 900e8e3
Feature/issue 602 podcast quality improvements (#627)
* feat: Implement podcast quality improvements - dynamic chapters, transcript downloads, and prompt leakage fix (#602)
Implements all three phases of Issue #602 to enhance podcast generation quality:
**Phase 1: Prompt Leakage Prevention**
- Add CoT hardening with XML tag separation (<thinking> and <script>)
- Create PodcastScriptParser with 5-layer fallback parsing (XML → JSON → Markdown → Regex → Full)
- Implement quality scoring (0.0-1.0) with artifact detection
- Add retry logic with quality threshold (min 0.6, max 3 attempts)
- Update PODCAST_SCRIPT_PROMPT with strict rules to prevent meta-information
- Fix 2 failing unit tests by updating mock responses
**Phase 2: Dynamic Chapter Generation**
- Add PodcastChapter schema with title, start_time, end_time, word_count
- Update PodcastScript, PodcastGenerationOutput, and Podcast model with chapters field
- Implement chapter extraction from HOST questions in script_parser.py
- Calculate accurate timestamps based on word counts (±10 sec accuracy @ 150 WPM)
- Add smart title extraction with pattern removal for clean chapter names
- Update podcast_repository.py to store/retrieve chapters as JSON
- Serialize chapters when marking podcasts complete
**Phase 3: Transcript Download**
- Create TranscriptFormatter utility with 2 formats:
- Plain text (.txt): Simple format with metadata header
- Markdown (.md): Formatted with table of contents and chapter timestamps
- Add download endpoint: GET /api/podcasts/{podcast_id}/transcript/download?format=txt|md
- Implement artifact cleaning and time formatting (HH:MM:SS)
- Add authentication and access control
- Return properly formatted downloadable files with correct Content-Disposition headers
**Files Changed:**
- Created: backend/rag_solution/utils/podcast_script_parser.py (374 lines)
- Created: backend/rag_solution/utils/transcript_formatter.py (247 lines)
- Updated: backend/rag_solution/schemas/podcast_schema.py
- Updated: backend/rag_solution/models/podcast.py
- Updated: backend/rag_solution/services/podcast_service.py
- Updated: backend/rag_solution/utils/script_parser.py
- Updated: backend/rag_solution/repository/podcast_repository.py
- Updated: backend/rag_solution/router/podcast_router.py
- Updated: tests/unit/services/test_podcast_service_unit.py
**Testing:**
- Unit tests: 1969/1969 passed (100%)
- Podcast integration tests: 7/7 passed (100%)
- All files pass linting checks (ruff)
- Maintains 90%+ test coverage for podcast service
**Technical Notes:**
- CoT hardening follows industry patterns (Anthropic Claude, OpenAI ReAct)
- Multi-layer fallback ensures robustness
- Chapter timestamps accurate to ±10 seconds
- Backward compatible (chapters default to empty list)
- Clean separation of concerns with utility classes
Closes #602
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: Add database migration for chapters column
Add migration scripts to add chapters JSONB column to podcasts table.
Migration can be applied using:
1. SQL: migrations/add_chapters_to_podcasts.sql
2. Python: poetry run python migrations/apply_chapters_migration.py
3. Docker: docker exec rag_modulo-postgres-1 psql -U rag_modulo_user -d rag_modulo -c "ALTER TABLE podcasts ADD COLUMN IF NOT EXISTS chapters JSONB DEFAULT '[]'::jsonb;"
The chapters column stores dynamic chapter markers with timestamps.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(podcast): Address type safety, code duplication, and linting issues
**Type Safety & Code Duplication (Issues #1, #2)**:
- Add _serialize_chapters() helper method with null check
- Refactor duplicate chapter serialization code (lines 414-424 and 1563-1573)
- Returns empty list if chapters is None (prevents TypeError)
**Linting (Issue #6)**:
- Remove unused chapters parameter from to_txt() method
- Update format_transcript() to not pass chapters to to_txt()
- Plain text format doesn't use chapters (only Markdown does)
Addresses PR #627 review comments.
* fix(podcast): Address critical PR #627 review issues
Fix 3 critical issues identified in PR #627 review:
1. **Migration Script Safety**: Replace autocommit with proper transactions
- Remove `conn.autocommit=True`
- Add explicit commit/rollback in try/except/finally blocks
- Prevents database inconsistency on errors
2. **ReDoS Mitigation**: Add input length validation
- Add MAX_INPUT_LENGTH=100KB constant to PodcastScriptParser
- Validate input length before regex operations
- Raises ValueError if input exceeds limit
- Protects against catastrophic backtracking
3. **Retry Logic Optimization**: Reduce cost and latency
- Reduce max_retries from 3→2 (saves ~30s, $0.01-0.05/retry)
- Add exponential backoff (2^attempt * 1.0s base delay)
- Apply backoff for both quality retries and error recovery
- Better handling of transient failures
Files modified:
- migrations/apply_chapters_migration.py: Transaction safety
- backend/rag_solution/utils/podcast_script_parser.py: ReDoS mitigation
- backend/rag_solution/services/podcast_service.py: Retry optimization
Addresses review comment: #627 (comment)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* test(podcast): Add comprehensive unit tests for podcast utilities
Add 76 unit tests covering:
**1. PodcastScriptParser (39 tests)**
- All 5 parsing strategies (XML, JSON, Markdown, Regex, Full Response)
- Quality scoring algorithm (0.0-1.0 confidence)
- Artifact detection (prompt leakage patterns)
- ReDoS mitigation (100KB input length validation)
- Script cleaning and whitespace normalization
- Edge cases (empty input, malformed JSON, non-ASCII chars)
**2. TranscriptFormatter (37 tests)**
- Plain text format (txt) with metadata header
- Markdown format (md) with chapters and TOC
- Time formatting (HH:MM:SS and MM:SS)
- Transcript cleaning (XML tags, metadata removal)
- Edge cases (empty transcripts, special characters, Unicode)
Test files:
- tests/unit/utils/test_podcast_script_parser.py (680 lines)
- tests/unit/utils/test_transcript_formatter.py (470 lines)
Coverage:
- podcast_script_parser.py: 100% coverage
- transcript_formatter.py: 100% coverage
All 76 tests pass in 0.3s.
Addresses PR #627 review comment requirement for comprehensive test coverage.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* test(podcast): Add integration tests for transcript download endpoint
Add 8 comprehensive integration tests for transcript download functionality:
**Test Coverage:**
1. Download transcript in TXT format
2. Download transcript in Markdown format with chapters
3. Handle podcast not found (404)
4. Handle podcast not completed (400)
5. Handle missing transcript field (404)
6. Verify filename generation logic
7. Verify chapter data in Markdown format
8. Verify Markdown format without chapters
**Integration Test Details:**
- Tests complete end-to-end workflow from service to formatter
- Mocked PodcastService with sample completed podcast
- Tests both txt and md format outputs
- Tests error conditions (not found, incomplete, missing transcript)
- Tests chapter handling (with/without chapters)
- Tests filename generation with/without title
**File Modified:**
- tests/integration/test_podcast_generation_integration.py (+300 lines)
All 8 tests pass in 6.4s.
Addresses PR #627 review comment requirement for comprehensive integration test coverage of the download transcript endpoint.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent aa63716 commit 900e8e3
File tree
8 files changed
+1700
-23
lines changed- backend/rag_solution
- services
- utils
- migrations
- tests
- integration
- unit/utils
8 files changed
+1700
-23
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
412 | 413 | | |
413 | 414 | | |
414 | 415 | | |
415 | | - | |
416 | | - | |
417 | | - | |
418 | | - | |
419 | | - | |
420 | | - | |
421 | | - | |
422 | | - | |
423 | | - | |
| 416 | + | |
424 | 417 | | |
425 | 418 | | |
426 | 419 | | |
| |||
744 | 737 | | |
745 | 738 | | |
746 | 739 | | |
747 | | - | |
748 | | - | |
| 740 | + | |
| 741 | + | |
749 | 742 | | |
| 743 | + | |
750 | 744 | | |
751 | 745 | | |
752 | 746 | | |
753 | 747 | | |
754 | 748 | | |
755 | 749 | | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
756 | 756 | | |
757 | 757 | | |
758 | 758 | | |
| |||
807 | 807 | | |
808 | 808 | | |
809 | 809 | | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
810 | 814 | | |
811 | 815 | | |
812 | 816 | | |
| |||
1177 | 1181 | | |
1178 | 1182 | | |
1179 | 1183 | | |
| 1184 | + | |
| 1185 | + | |
| 1186 | + | |
| 1187 | + | |
| 1188 | + | |
| 1189 | + | |
| 1190 | + | |
| 1191 | + | |
| 1192 | + | |
| 1193 | + | |
| 1194 | + | |
| 1195 | + | |
| 1196 | + | |
| 1197 | + | |
| 1198 | + | |
| 1199 | + | |
| 1200 | + | |
| 1201 | + | |
| 1202 | + | |
| 1203 | + | |
| 1204 | + | |
| 1205 | + | |
| 1206 | + | |
| 1207 | + | |
1180 | 1208 | | |
1181 | 1209 | | |
1182 | 1210 | | |
| |||
1561 | 1589 | | |
1562 | 1590 | | |
1563 | 1591 | | |
1564 | | - | |
1565 | | - | |
1566 | | - | |
1567 | | - | |
1568 | | - | |
1569 | | - | |
1570 | | - | |
1571 | | - | |
1572 | | - | |
| 1592 | + | |
1573 | 1593 | | |
1574 | 1594 | | |
1575 | 1595 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
52 | 55 | | |
53 | 56 | | |
54 | 57 | | |
| |||
103 | 106 | | |
104 | 107 | | |
105 | 108 | | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
106 | 112 | | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
107 | 126 | | |
108 | 127 | | |
109 | 128 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
89 | | - | |
90 | 89 | | |
91 | 90 | | |
92 | 91 | | |
| |||
104 | 103 | | |
105 | 104 | | |
106 | 105 | | |
107 | | - | |
108 | 106 | | |
109 | 107 | | |
110 | 108 | | |
| |||
250 | 248 | | |
251 | 249 | | |
252 | 250 | | |
253 | | - | |
| 251 | + | |
254 | 252 | | |
255 | 253 | | |
256 | 254 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
0 commit comments