Podcast Quality Improvements: Dynamic Chapters, Transcript Download, and Prompt Leakage Fix

## Overview

Improve podcast quality and user experience by fixing prompt leakage in transcripts, implementing dynamic chapters from Q&A structure, and adding transcript download functionality.

---

## Problem 1: Prompt Leakage in Transcript 🐛

**Current Behavior:**
The podcast transcript includes internal LLM prompts and instructions that should not be visible to users.

**Example from transcript:**
```
Thank you for having me. It was a pleasure to share IBM's comprehensive approach...
[End of script]

This script covers key topics from the provided documents...
Word count: 3,200 (approximately 20 minutes at 160 words/minute)

**Instruction 3 (Most Difficult):** Create a comprehensive podcast script that explores...
```

**Root Cause:**
Similar to the Chain of Thought (CoT) leakage issue fixed in #461, the LLM's reasoning/instructions are bleeding into the final output.

**Proposed Solution:**
Apply the same hardening pattern used for CoT:
1. **Structured output with XML tags**: `<thinking>` and `<script>`
2. **Multi-layer parsing**: 5 fallback strategies (XML → JSON → markers → regex → full response)
3. **Quality scoring**: Confidence assessment (0.0-1.0) with artifact detection
4. **Retry logic**: Up to 3 attempts with quality threshold validation
5. **Enhanced prompts**: System rules + few-shot examples

**Reference:**
- CoT hardening implementation: `docs/features/chain-of-thought-hardening.md`
- Original CoT fix: Issue #461

---

## Problem 2: Hardcoded Chapters 📚

**Current Behavior:**
Podcast chapters are hardcoded placeholder data:
```
00:00 - 01:00 | Introduction & Welcome
01:00 - 02:30 | IBM's Technology Stack Overview
02:30 - 04:00 | Strategic Evolution
04:00 - 06:00 | Future Investments & Focus Areas
```

**Proposed Solution:**
Generate chapters dynamically from the actual podcast Q&A structure:

1. **Extract questions/topics** from the HOST/EXPERT dialogue
2. **Generate timestamps** for each section based on word count
3. **Make chapters clickable** to jump to that part of the audio
4. **Format**: `00:00 - 01:30 | How does IBM's business strategy work?`

**Benefits:**
- Accurate reflection of actual content
- Better user navigation
- Improved accessibility

---

## Problem 3: Missing Transcript Download Button 📥

**Current Behavior:**
Users can view the transcript on the page but cannot download it.

**Proposed Solution:**
Add a **"Download Transcript"** button at the top of the podcast page (next to "Share" and "Hide Transcript").

**Requirements:**
- Download as `.txt` or `.md` format
- Include podcast metadata (title, duration, date)
- Clean transcript without prompt leakage artifacts

---

## Implementation Plan

### Phase 1: Fix Prompt Leakage (Critical)
- [ ] Apply CoT hardening pattern to podcast script generation
- [ ] Implement XML tag separation (`<thinking>` and `<script>`)
- [ ] Add multi-layer parsing with fallback strategies
- [ ] Implement quality scoring and retry logic
- [ ] Add comprehensive testing

### Phase 2: Dynamic Chapters
- [ ] Parse HOST/EXPERT dialogue structure
- [ ] Extract questions/topics from script
- [ ] Calculate timestamps based on word count
- [ ] Update frontend to render dynamic chapters
- [ ] Add clickable chapter navigation

### Phase 3: Transcript Download
- [ ] Add "Download Transcript" button to UI
- [ ] Implement transcript download endpoint
- [ ] Format transcript with metadata
- [ ] Support `.txt` and `.md` formats

---

## Acceptance Criteria

### Prompt Leakage Fix
- [ ] Transcripts contain only the actual podcast dialogue
- [ ] No LLM instructions or meta-commentary visible
- [ ] Quality score ≥ 0.6 (configurable)
- [ ] Retry logic handles failures gracefully

### Dynamic Chapters
- [ ] Chapters reflect actual Q&A structure
- [ ] Timestamps are accurate (±10 seconds)
- [ ] Chapters are clickable and navigate to correct position
- [ ] UI displays chapters in collapsible format

### Transcript Download
- [ ] Button appears on podcast details page
- [ ] Download includes clean transcript + metadata
- [ ] Supports `.txt` and `.md` formats
- [ ] File naming: `{podcast_title}_transcript.{format}`

---

## Testing

### Manual Testing
- [ ] Generate new podcast and verify clean transcript
- [ ] Verify chapters match actual content
- [ ] Test chapter click navigation
- [ ] Download transcript and verify format
- [ ] Test with different podcast lengths (5, 15, 30 min)

### Automated Testing
- [ ] Unit tests for prompt parsing logic
- [ ] Integration tests for chapter generation
- [ ] API tests for transcript download endpoint

---

## References

- CoT Hardening: `docs/features/chain-of-thought-hardening.md`
- CoT Quick Reference: `docs/features/cot-quick-reference.md`
- Original CoT Issue: #461
- Podcast Service: `backend/rag_solution/services/podcast_service.py`
- Podcast Router: `backend/rag_solution/router/podcast_router.py`

---

## Priority

**High** - Affects user experience and podcast quality

## Estimated Effort

- Prompt Leakage Fix: 4-6 hours (reuse CoT patterns)
- Dynamic Chapters: 3-4 hours
- Transcript Download: 2-3 hours

**Total:** ~10-13 hours

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Podcast Quality Improvements: Dynamic Chapters, Transcript Download, and Prompt Leakage Fix #602

Overview

Problem 1: Prompt Leakage in Transcript 🐛

Problem 2: Hardcoded Chapters 📚

Problem 3: Missing Transcript Download Button 📥

Implementation Plan

Phase 1: Fix Prompt Leakage (Critical)

Phase 2: Dynamic Chapters

Phase 3: Transcript Download

Acceptance Criteria

Prompt Leakage Fix

Dynamic Chapters

Transcript Download

Testing

Manual Testing

Automated Testing

References

Priority

Estimated Effort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Podcast Quality Improvements: Dynamic Chapters, Transcript Download, and Prompt Leakage Fix #602

Description

Overview

Problem 1: Prompt Leakage in Transcript 🐛

Problem 2: Hardcoded Chapters 📚

Problem 3: Missing Transcript Download Button 📥

Implementation Plan

Phase 1: Fix Prompt Leakage (Critical)

Phase 2: Dynamic Chapters

Phase 3: Transcript Download

Acceptance Criteria

Prompt Leakage Fix

Dynamic Chapters

Transcript Download

Testing

Manual Testing

Automated Testing

References

Priority

Estimated Effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions