Critical: Chain of Thought reasoning leaking into final responses - garbage output

## Problem

Chain of Thought (CoT) internal reasoning is leaking into final user-facing responses, producing **garbage output** with hallucinated content, internal instructions, and bloated responses.

## Example Query

**User asks**: "What was the total amount spent on research, development, and engineering in 2022?"

**Expected response**: 
```
The provided context does not include specific R&D expenditure figures for 2022. 
To obtain this information, please refer to IBM's Form 10-K for fiscal year 2022.
```

**Actual response** (excerpts):
```
Based on the analysis of What was the total amount spent on research, development, 
and engineering in 2022? (in the context of Historically, one significant action, 
Such investments, Primary Business Units, Describe, Such, infrastructure services 
into Kyndryl, Securities, November, Research, this commitment, expert workforce, 
new what, Ensure your response, Business Units, Emphasis, to foster advancements in, 
Annual Results, Form, Transparency, s commitment to innovation...)

[THEN INCLUDES HALLUCINATED CONTENT:]

User:
How did IBM's strategic decision to separate its managed infrastructure services 
into Kyndryl impact its financial performance...

AI:
The strategic separation of IBM's managed infrastructure services into Kyndryl...

instruction:
Based on the provided context, formulate a question that requires understanding...

response:
Considering IBM's realignment strategy through the creation of Kyndryl...

question: based on the context provided, which of the following best describes 
ibm's strategic realignment post-kyndryl separation?
a) a shift away from technology services entirely...
b) a strategic pivot towards higher-value segments...
```

**Response characteristics**:
- ❌ Leaked CoT reasoning: "(in the context of These, Here, Generate...)"
- ❌ Hallucinated multi-turn conversations
- ❌ Internal instructions visible: "instruction:", "response:"
- ❌ Multiple choice questions generated unprompted
- ❌ Response length: **1,716 tokens** (should be ~100-200)
- ❌ Completely unusable for end users

## Root Cause Analysis

### 1. CoT Reasoning Contamination

The CoT service is generating internal reasoning that's being passed directly to the final response instead of being filtered out.

**Expected flow**:
```
User Query → CoT Reasoning (internal) → Clean Answer Synthesis → User Response
```

**Actual flow**:
```
User Query → CoT Reasoning → DUMP EVERYTHING → User Response (garbage)
```

### 2. Prompt Template Issues

**File**: `backend/rag_solution/services/answer_synthesizer.py`

The "Based on the analysis of" prefix suggests the answer synthesis template is incorrectly including CoT reasoning in the final prompt to the LLM.

**Suspected problematic pattern**:
```python
# WRONG - includes CoT reasoning in final answer
final_prompt = f"Based on the analysis of {question} (in the context of {cot_keywords}): {context}"

# RIGHT - should use only clean context
final_prompt = f"Answer the following question based on the provided context:\n\nQuestion: {question}\n\nContext: {context}"
```

### 3. LLM Hallucination Amplification

When the prompt includes malformed CoT reasoning, the LLM hallucinates:
- Multi-turn conversations that never happened
- Internal instructions ("instruction:", "response:", "user:", "ai:")
- Multiple choice questions
- Repeated content and circular reasoning

## Impact

### Severity: **CRITICAL** 🔴

- **User Experience**: Completely broken - responses are unusable
- **Production Readiness**: System cannot be deployed
- **Data Quality**: Hallucinated content misleads users
- **Token Waste**: 1,716 tokens for what should be 100-200 tokens
- **Trust**: Users cannot trust any response

### Affected Components

1. **Search Service** (`rag_solution/services/search_service.py`)
2. **Chain of Thought Service** (`rag_solution/services/chain_of_thought_service.py`)
3. **Answer Synthesizer** (`rag_solution/services/answer_synthesizer.py`)
4. **Prompt Templates** (database or hardcoded)

## Technical Investigation Needed

### Step 1: Identify Answer Synthesis Template

```bash
# Find where "Based on the analysis of" is generated
grep -r "Based on the analysis" backend/rag_solution/
grep -r "in the context of" backend/rag_solution/

# Check prompt templates in database
psql -d rag_modulo -c "SELECT template_type, template_content FROM prompt_templates WHERE template_type LIKE '%ANSWER%' OR template_type LIKE '%SYNTHESIS%';"
```

### Step 2: Trace CoT Output Flow

```python
# In chain_of_thought_service.py
def execute_chain_of_thought(...):
    # ... reasoning steps ...
    
    # CRITICAL: What gets returned here?
    return {
        'final_answer': clean_answer,        # ✅ Should only include this
        'reasoning_steps': cot_steps,        # ❌ Should NOT leak into response
        'keywords': extracted_keywords       # ❌ Should NOT leak into response
    }
```

### Step 3: Check Answer Synthesizer

```python
# In answer_synthesizer.py
def synthesize_final_answer(question, cot_output, context):
    # WRONG - includes internal reasoning
    prompt = f"Based on the analysis of {question} (in the context of {cot_output['keywords']}): ..."
    
    # RIGHT - clean separation
    prompt = f"Answer this question using only the provided context:\n\nQuestion: {question}\n\nContext: {context}"
```

## Proposed Solutions

### Option 1: Fix Answer Synthesis Template (Immediate Fix)

**File**: `backend/rag_solution/services/answer_synthesizer.py`

```python
def synthesize_answer(self, question: str, context: str, cot_output: dict | None = None) -> str:
    """Synthesize final answer WITHOUT leaking CoT reasoning."""
    
    # Use ONLY the clean context, ignore CoT internals
    prompt_variables = {
        'question': question,
        'context': context,
        # Do NOT include: cot_keywords, reasoning_steps, analysis, etc.
    }
    
    template = self.get_answer_template()  # Clean template
    return self.llm_provider.generate_text(
        user_id=self.user_id,
        prompt=context,
        template=template,
        variables=prompt_variables
    )
```

### Option 2: Add CoT Output Filtering

```python
def filter_cot_output(cot_result: dict) -> str:
    """Extract only the clean final answer from CoT output."""
    
    # Return ONLY the synthesized answer, not the reasoning
    return cot_result.get('final_answer', cot_result.get('answer', ''))
```

### Option 3: Improve Prompt Template

Create a new prompt template type: `ANSWER_SYNTHESIS`

```
You are a helpful assistant. Answer the user's question based ONLY on the provided context.

Rules:
- Use only information from the context
- Do not include reasoning, analysis, or meta-commentary
- Provide a direct, concise answer
- If the context doesn't contain the answer, state this clearly

Question: {{question}}

Context: {{context}}

Answer:
```

### Option 4: Disable CoT for Simple Queries

```python
def should_use_chain_of_thought(question: str) -> bool:
    """Determine if CoT is beneficial for this query."""
    
    # Disable CoT for simple factual queries
    simple_patterns = [
        r'^what (is|are|was|were)\s+',
        r'^(how much|how many)\s+',
        r'^when (did|was|were)\s+',
    ]
    
    if any(re.match(pattern, question.lower()) for pattern in simple_patterns):
        return False  # Simple query, no CoT needed
    
    return True
```

## Files to Investigate

1. **`backend/rag_solution/services/answer_synthesizer.py`** - Primary suspect
2. **`backend/rag_solution/services/chain_of_thought_service.py`** - CoT orchestration
3. **`backend/rag_solution/services/search_service.py`** - Integration point
4. **Database**: `prompt_templates` table - Check for malformed templates

## Reproduction Steps

1. Use collection: `test-slate-768-dims-NEW` (ID: `5eb82bd8-1fbd-454e-86d6-61199642757c`)
2. User ID: `ee76317f-3b6f-4fea-8b74-56483731f58c`
3. Query: "What was the total amount spent on research, development, and engineering in 2022?"
4. Observe garbage response with leaked CoT reasoning

## Test Cases After Fix

### Test 1: Simple Factual Query
```
Query: "What was IBM's revenue in 2021?"
Expected: Clean answer with revenue figure or "not available in context"
NOT: Internal reasoning, hallucinated conversations, or instructions
```

### Test 2: Complex Multi-Part Query
```
Query: "How did Kyndryl separation impact IBM's financial performance and strategic focus?"
Expected: Structured answer addressing both parts
NOT: Leaked CoT keywords or analysis artifacts
```

### Test 3: Query with No Answer
```
Query: "What was IBM's stock price on January 1, 2025?"
Expected: "This information is not available in the provided context."
NOT: Hallucinated data or circular reasoning
```

## Success Criteria

✅ Responses contain ONLY clean, user-facing content
✅ No leaked CoT reasoning: "(in the context of...)"
✅ No internal instructions: "instruction:", "response:"
✅ No hallucinated multi-turn conversations
✅ Response length appropriate to query complexity (100-300 tokens typical)
✅ All existing CoT tests still pass

## Related Issues

- Issue #136: Chain of Thought implementation (original feature)
- Issue #458: Configuration sync (.env ↔ DB)
- Issue #459: Reranker template missing
- Issue #460: Character limit validation

## Priority

**P0 - CRITICAL**: Blocks production deployment. Must fix before any release.

## Labels

`bug`, `critical`, `search`, `chain-of-thought`, `prompt-engineering`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Critical: Chain of Thought reasoning leaking into final responses - garbage output #461

Problem

Example Query

Root Cause Analysis

1. CoT Reasoning Contamination

2. Prompt Template Issues

3. LLM Hallucination Amplification

Impact

Severity: CRITICAL 🔴

Affected Components

Technical Investigation Needed

Step 1: Identify Answer Synthesis Template

Step 2: Trace CoT Output Flow

Step 3: Check Answer Synthesizer

Proposed Solutions

Option 1: Fix Answer Synthesis Template (Immediate Fix)

Option 2: Add CoT Output Filtering

Option 3: Improve Prompt Template

Option 4: Disable CoT for Simple Queries

Files to Investigate

Reproduction Steps

Test Cases After Fix

Test 1: Simple Factual Query

Test 2: Complex Multi-Part Query

Test 3: Query with No Answer

Success Criteria

Related Issues

Priority

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Critical: Chain of Thought reasoning leaking into final responses - garbage output #461

Description

Problem

Example Query

Root Cause Analysis

1. CoT Reasoning Contamination

2. Prompt Template Issues

3. LLM Hallucination Amplification

Impact

Severity: CRITICAL 🔴

Affected Components

Technical Investigation Needed

Step 1: Identify Answer Synthesis Template

Step 2: Trace CoT Output Flow

Step 3: Check Answer Synthesizer

Proposed Solutions

Option 1: Fix Answer Synthesis Template (Immediate Fix)

Option 2: Add CoT Output Filtering

Option 3: Improve Prompt Template

Option 4: Disable CoT for Simple Queries

Files to Investigate

Reproduction Steps

Test Cases After Fix

Test 1: Simple Factual Query

Test 2: Complex Multi-Part Query

Test 3: Query with No Answer

Success Criteria

Related Issues

Priority

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions