-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Milestone
Description
📋 Description
The LLM can generate podcast scripts that are significantly shorter or longer than requested, but the system accepts whatever comes back without validation or retry. This leads to podcasts with wildly incorrect durations.
Example:
- User requests 15-minute podcast (target: 2,250 words at 150 WPM)
- LLM generates only 500 words (ignores instruction)
- System accepts it → Result: 3-minute podcast instead of 15 minutes
🎯 Goals
- Validate script word count after LLM generation
- Implement retry mechanism if word count is way off
- Use adaptive prompts that learn from previous failures
- Track retry attempts for debugging and metrics
📍 Current State
Current Flow:
1. Calculate target word count (e.g., 2,250 for 15 min)
2. Ask LLM to generate script with word count instruction
3. LLM returns script (could be 500 words or 5,000 words!)
4. ❌ Accept whatever comes back - NO VALIDATION
5. Generate audio from potentially wrong-length script
Evidence:
- See
backend/tests/unit/test_podcast_duration_control_unit.py:test_llm_generates_too_short_script_no_validationtest_llm_generates_too_long_script_no_validationtest_no_retry_mechanism_for_short_scripttest_no_adaptive_prompt_based_on_previous_attempts
✅ Acceptance Criteria
Phase 1: Validation
- Count words in generated script
- Validate against target word count
- If < 80% of target OR > 120% of target, mark as failed
- Log word count mismatch details
Phase 2: Retry Mechanism
- If validation fails, retry up to 3 times
- Track retry count and reasons
- Each retry uses adjusted prompt
- After 3 failures, mark podcast as
FAILEDwith reason
Phase 3: Adaptive Prompts
- First attempt: Standard prompt with target word count
- If too short: "Previous attempt was only X words. Generate EXACTLY Y words with more detail."
- If too long: "Previous attempt was X words. Generate EXACTLY Y words, be more concise."
- Include previous failure context in retry prompts
Phase 4: Voice Speed Consideration (Future)
- Adjust word count calculation based on voice speed setting
- At 1.5x speed: target_words = duration * 150 * 1.5
- Store speed-adjusted target in metadata
🛠️ Implementation Strategy
Script Validation Logic
def _validate_script_word_count(
self,
script: str,
target_word_count: int,
min_word_count: int,
max_word_count: int
) -> tuple[bool, int, str | None]:
"""Validate script word count is within acceptable range.
Returns:
(is_valid, actual_count, error_message)
"""
actual_count = len(script.split())
if actual_count < min_word_count:
error = (
f"Script too short: {actual_count} words "
f"(need at least {min_word_count}, target {target_word_count})"
)
return False, actual_count, error
if actual_count > max_word_count:
error = (
f"Script too long: {actual_count} words "
f"(max {max_word_count}, target {target_word_count})"
)
return False, actual_count, error
return True, actual_count, NoneRetry Logic with Adaptive Prompts
async def _generate_script_with_retry(
self,
podcast_input: PodcastGenerationInput,
rag_results: str,
max_retries: int = 3
) -> str:
"""Generate script with validation and retry."""
target_word_count = self._calculate_target_word_count(podcast_input.duration)
min_word_count = int(target_word_count * 0.8) # 80% of target
max_word_count = int(target_word_count * 1.2) # 120% of target
previous_attempts: list[dict] = []
for attempt in range(max_retries):
# Generate prompt (adaptive based on previous failures)
prompt_context = self._build_adaptive_prompt_context(
previous_attempts, target_word_count
)
# Generate script
script = await self._generate_script(
podcast_input, rag_results, prompt_context
)
# Validate word count
is_valid, actual_count, error = self._validate_script_word_count(
script, target_word_count, min_word_count, max_word_count
)
if is_valid:
logger.info(
f"Script generated successfully: {actual_count} words "
f"(target: {target_word_count}) on attempt {attempt + 1}"
)
return script
# Track failed attempt
previous_attempts.append({
"attempt": attempt + 1,
"actual_count": actual_count,
"target_count": target_word_count,
"error": error
})
logger.warning(f"Attempt {attempt + 1} failed: {error}")
# All retries failed
raise PodcastGenerationError(
f"Failed to generate script with correct length after {max_retries} attempts. "
f"Last attempt: {previous_attempts[-1]['actual_count']} words "
f"(target: {target_word_count})"
)
def _build_adaptive_prompt_context(
self,
previous_attempts: list[dict],
target_word_count: int
) -> str:
"""Build adaptive prompt context based on previous failures."""
if not previous_attempts:
return ""
last_attempt = previous_attempts[-1]
last_count = last_attempt["actual_count"]
if last_count < target_word_count:
# Previous attempt was too short
return (
f"\n\nIMPORTANT: Your previous attempt was only {last_count} words, "
f"which is too short. This time, generate EXACTLY {target_word_count} words "
f"by adding more detail, examples, and explanations."
)
else:
# Previous attempt was too long
return (
f"\n\nIMPORTANT: Your previous attempt was {last_count} words, "
f"which is too long. This time, generate EXACTLY {target_word_count} words "
f"by being more concise and focused."
)Schema Updates
Add to PodcastGenerationOutput:
class PodcastGenerationOutput(BaseModel):
# ... existing fields ...
# NEW FIELDS:
script_word_count: int | None = None
script_generation_attempts: int = 1
script_validation_warnings: list[str] | None = None🧪 Testing
Unit Tests
- Test word count calculation
- Test validation (too short, too long, just right)
- Test retry mechanism (succeeds on 2nd attempt)
- Test retry exhaustion (fails after 3 attempts)
- Test adaptive prompt generation
- Test voice speed adjustment (future)
Integration Tests
- Generate podcast with LLM that returns short script
- Verify retry happens
- Verify adaptive prompt is used
- Verify success after retry
📊 Metrics to Track
- Script generation attempts histogram (1, 2, 3, >3)
- Word count accuracy distribution
- Retry success rate
- Common failure patterns (always too short? always too long?)
🔧 Configuration
Add to Settings:
# Podcast script validation
podcast_min_word_count_percentage: float = 0.8 # 80% of target
podcast_max_word_count_percentage: float = 1.2 # 120% of target
podcast_max_script_retries: int = 3🔗 Related Files
backend/rag_solution/services/podcast_service.py:440-515(_generate_script)backend/rag_solution/schemas/podcast_schema.pybackend/tests/unit/test_podcast_duration_control_unit.pybackend/tests/PODCAST_DURATION_CONTROL_ANALYSIS.md
🏷️ Labels
enhancement, podcast, quality, llm, validation, retry-logic
📚 References
- Related to issue feat: Implement audio duration measurement and quality gates #362 (audio duration measurement)
- Documented in PR feat: Major podcast UI improvements and authentication fixes #360 test files
- Analysis in
PODCAST_DURATION_CONTROL_ANALYSIS.md
💡 Future Enhancements
- Machine learning to predict optimal word count based on content type
- A/B testing different prompts for word count accuracy
- User feedback loop: "Was this podcast too short/long?"
- Automatic word count adjustment based on historical accuracy
Metadata
Metadata
Assignees
Labels
No labels