Skip to content

feat: Implement script word count validation with adaptive retry #363

@manavgup

Description

@manavgup

📋 Description

The LLM can generate podcast scripts that are significantly shorter or longer than requested, but the system accepts whatever comes back without validation or retry. This leads to podcasts with wildly incorrect durations.

Example:

  • User requests 15-minute podcast (target: 2,250 words at 150 WPM)
  • LLM generates only 500 words (ignores instruction)
  • System accepts it → Result: 3-minute podcast instead of 15 minutes

🎯 Goals

  1. Validate script word count after LLM generation
  2. Implement retry mechanism if word count is way off
  3. Use adaptive prompts that learn from previous failures
  4. Track retry attempts for debugging and metrics

📍 Current State

Current Flow:

1. Calculate target word count (e.g., 2,250 for 15 min)
2. Ask LLM to generate script with word count instruction
3. LLM returns script (could be 500 words or 5,000 words!)
4. ❌ Accept whatever comes back - NO VALIDATION
5. Generate audio from potentially wrong-length script

Evidence:

  • See backend/tests/unit/test_podcast_duration_control_unit.py:
    • test_llm_generates_too_short_script_no_validation
    • test_llm_generates_too_long_script_no_validation
    • test_no_retry_mechanism_for_short_script
    • test_no_adaptive_prompt_based_on_previous_attempts

✅ Acceptance Criteria

Phase 1: Validation

  • Count words in generated script
  • Validate against target word count
  • If < 80% of target OR > 120% of target, mark as failed
  • Log word count mismatch details

Phase 2: Retry Mechanism

  • If validation fails, retry up to 3 times
  • Track retry count and reasons
  • Each retry uses adjusted prompt
  • After 3 failures, mark podcast as FAILED with reason

Phase 3: Adaptive Prompts

  • First attempt: Standard prompt with target word count
  • If too short: "Previous attempt was only X words. Generate EXACTLY Y words with more detail."
  • If too long: "Previous attempt was X words. Generate EXACTLY Y words, be more concise."
  • Include previous failure context in retry prompts

Phase 4: Voice Speed Consideration (Future)

  • Adjust word count calculation based on voice speed setting
  • At 1.5x speed: target_words = duration * 150 * 1.5
  • Store speed-adjusted target in metadata

🛠️ Implementation Strategy

Script Validation Logic

def _validate_script_word_count(
    self,
    script: str,
    target_word_count: int,
    min_word_count: int,
    max_word_count: int
) -> tuple[bool, int, str | None]:
    """Validate script word count is within acceptable range.
    
    Returns:
        (is_valid, actual_count, error_message)
    """
    actual_count = len(script.split())
    
    if actual_count < min_word_count:
        error = (
            f"Script too short: {actual_count} words "
            f"(need at least {min_word_count}, target {target_word_count})"
        )
        return False, actual_count, error
    
    if actual_count > max_word_count:
        error = (
            f"Script too long: {actual_count} words "
            f"(max {max_word_count}, target {target_word_count})"
        )
        return False, actual_count, error
    
    return True, actual_count, None

Retry Logic with Adaptive Prompts

async def _generate_script_with_retry(
    self,
    podcast_input: PodcastGenerationInput,
    rag_results: str,
    max_retries: int = 3
) -> str:
    """Generate script with validation and retry."""
    
    target_word_count = self._calculate_target_word_count(podcast_input.duration)
    min_word_count = int(target_word_count * 0.8)  # 80% of target
    max_word_count = int(target_word_count * 1.2)  # 120% of target
    
    previous_attempts: list[dict] = []
    
    for attempt in range(max_retries):
        # Generate prompt (adaptive based on previous failures)
        prompt_context = self._build_adaptive_prompt_context(
            previous_attempts, target_word_count
        )
        
        # Generate script
        script = await self._generate_script(
            podcast_input, rag_results, prompt_context
        )
        
        # Validate word count
        is_valid, actual_count, error = self._validate_script_word_count(
            script, target_word_count, min_word_count, max_word_count
        )
        
        if is_valid:
            logger.info(
                f"Script generated successfully: {actual_count} words "
                f"(target: {target_word_count}) on attempt {attempt + 1}"
            )
            return script
        
        # Track failed attempt
        previous_attempts.append({
            "attempt": attempt + 1,
            "actual_count": actual_count,
            "target_count": target_word_count,
            "error": error
        })
        
        logger.warning(f"Attempt {attempt + 1} failed: {error}")
    
    # All retries failed
    raise PodcastGenerationError(
        f"Failed to generate script with correct length after {max_retries} attempts. "
        f"Last attempt: {previous_attempts[-1]['actual_count']} words "
        f"(target: {target_word_count})"
    )

def _build_adaptive_prompt_context(
    self,
    previous_attempts: list[dict],
    target_word_count: int
) -> str:
    """Build adaptive prompt context based on previous failures."""
    
    if not previous_attempts:
        return ""
    
    last_attempt = previous_attempts[-1]
    last_count = last_attempt["actual_count"]
    
    if last_count < target_word_count:
        # Previous attempt was too short
        return (
            f"\n\nIMPORTANT: Your previous attempt was only {last_count} words, "
            f"which is too short. This time, generate EXACTLY {target_word_count} words "
            f"by adding more detail, examples, and explanations."
        )
    else:
        # Previous attempt was too long
        return (
            f"\n\nIMPORTANT: Your previous attempt was {last_count} words, "
            f"which is too long. This time, generate EXACTLY {target_word_count} words "
            f"by being more concise and focused."
        )

Schema Updates

Add to PodcastGenerationOutput:

class PodcastGenerationOutput(BaseModel):
    # ... existing fields ...
    
    # NEW FIELDS:
    script_word_count: int | None = None
    script_generation_attempts: int = 1
    script_validation_warnings: list[str] | None = None

🧪 Testing

Unit Tests

  • Test word count calculation
  • Test validation (too short, too long, just right)
  • Test retry mechanism (succeeds on 2nd attempt)
  • Test retry exhaustion (fails after 3 attempts)
  • Test adaptive prompt generation
  • Test voice speed adjustment (future)

Integration Tests

  • Generate podcast with LLM that returns short script
  • Verify retry happens
  • Verify adaptive prompt is used
  • Verify success after retry

📊 Metrics to Track

  • Script generation attempts histogram (1, 2, 3, >3)
  • Word count accuracy distribution
  • Retry success rate
  • Common failure patterns (always too short? always too long?)

🔧 Configuration

Add to Settings:

# Podcast script validation
podcast_min_word_count_percentage: float = 0.8  # 80% of target
podcast_max_word_count_percentage: float = 1.2  # 120% of target
podcast_max_script_retries: int = 3

🔗 Related Files

  • backend/rag_solution/services/podcast_service.py:440-515 (_generate_script)
  • backend/rag_solution/schemas/podcast_schema.py
  • backend/tests/unit/test_podcast_duration_control_unit.py
  • backend/tests/PODCAST_DURATION_CONTROL_ANALYSIS.md

🏷️ Labels

enhancement, podcast, quality, llm, validation, retry-logic

📚 References

💡 Future Enhancements

  • Machine learning to predict optimal word count based on content type
  • A/B testing different prompts for word count accuracy
  • User feedback loop: "Was this podcast too short/long?"
  • Automatic word count adjustment based on historical accuracy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions