-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem
Chain of Thought (CoT) internal reasoning is leaking into final user-facing responses, producing garbage output with hallucinated content, internal instructions, and bloated responses.
Example Query
User asks: "What was the total amount spent on research, development, and engineering in 2022?"
Expected response:
The provided context does not include specific R&D expenditure figures for 2022.
To obtain this information, please refer to IBM's Form 10-K for fiscal year 2022.
Actual response (excerpts):
Based on the analysis of What was the total amount spent on research, development,
and engineering in 2022? (in the context of Historically, one significant action,
Such investments, Primary Business Units, Describe, Such, infrastructure services
into Kyndryl, Securities, November, Research, this commitment, expert workforce,
new what, Ensure your response, Business Units, Emphasis, to foster advancements in,
Annual Results, Form, Transparency, s commitment to innovation...)
[THEN INCLUDES HALLUCINATED CONTENT:]
User:
How did IBM's strategic decision to separate its managed infrastructure services
into Kyndryl impact its financial performance...
AI:
The strategic separation of IBM's managed infrastructure services into Kyndryl...
instruction:
Based on the provided context, formulate a question that requires understanding...
response:
Considering IBM's realignment strategy through the creation of Kyndryl...
question: based on the context provided, which of the following best describes
ibm's strategic realignment post-kyndryl separation?
a) a shift away from technology services entirely...
b) a strategic pivot towards higher-value segments...
Response characteristics:
- ❌ Leaked CoT reasoning: "(in the context of These, Here, Generate...)"
- ❌ Hallucinated multi-turn conversations
- ❌ Internal instructions visible: "instruction:", "response:"
- ❌ Multiple choice questions generated unprompted
- ❌ Response length: 1,716 tokens (should be ~100-200)
- ❌ Completely unusable for end users
Root Cause Analysis
1. CoT Reasoning Contamination
The CoT service is generating internal reasoning that's being passed directly to the final response instead of being filtered out.
Expected flow:
User Query → CoT Reasoning (internal) → Clean Answer Synthesis → User Response
Actual flow:
User Query → CoT Reasoning → DUMP EVERYTHING → User Response (garbage)
2. Prompt Template Issues
File: backend/rag_solution/services/answer_synthesizer.py
The "Based on the analysis of" prefix suggests the answer synthesis template is incorrectly including CoT reasoning in the final prompt to the LLM.
Suspected problematic pattern:
# WRONG - includes CoT reasoning in final answer
final_prompt = f"Based on the analysis of {question} (in the context of {cot_keywords}): {context}"
# RIGHT - should use only clean context
final_prompt = f"Answer the following question based on the provided context:\n\nQuestion: {question}\n\nContext: {context}"3. LLM Hallucination Amplification
When the prompt includes malformed CoT reasoning, the LLM hallucinates:
- Multi-turn conversations that never happened
- Internal instructions ("instruction:", "response:", "user:", "ai:")
- Multiple choice questions
- Repeated content and circular reasoning
Impact
Severity: CRITICAL 🔴
- User Experience: Completely broken - responses are unusable
- Production Readiness: System cannot be deployed
- Data Quality: Hallucinated content misleads users
- Token Waste: 1,716 tokens for what should be 100-200 tokens
- Trust: Users cannot trust any response
Affected Components
- Search Service (
rag_solution/services/search_service.py) - Chain of Thought Service (
rag_solution/services/chain_of_thought_service.py) - Answer Synthesizer (
rag_solution/services/answer_synthesizer.py) - Prompt Templates (database or hardcoded)
Technical Investigation Needed
Step 1: Identify Answer Synthesis Template
# Find where "Based on the analysis of" is generated
grep -r "Based on the analysis" backend/rag_solution/
grep -r "in the context of" backend/rag_solution/
# Check prompt templates in database
psql -d rag_modulo -c "SELECT template_type, template_content FROM prompt_templates WHERE template_type LIKE '%ANSWER%' OR template_type LIKE '%SYNTHESIS%';"Step 2: Trace CoT Output Flow
# In chain_of_thought_service.py
def execute_chain_of_thought(...):
# ... reasoning steps ...
# CRITICAL: What gets returned here?
return {
'final_answer': clean_answer, # ✅ Should only include this
'reasoning_steps': cot_steps, # ❌ Should NOT leak into response
'keywords': extracted_keywords # ❌ Should NOT leak into response
}Step 3: Check Answer Synthesizer
# In answer_synthesizer.py
def synthesize_final_answer(question, cot_output, context):
# WRONG - includes internal reasoning
prompt = f"Based on the analysis of {question} (in the context of {cot_output['keywords']}): ..."
# RIGHT - clean separation
prompt = f"Answer this question using only the provided context:\n\nQuestion: {question}\n\nContext: {context}"Proposed Solutions
Option 1: Fix Answer Synthesis Template (Immediate Fix)
File: backend/rag_solution/services/answer_synthesizer.py
def synthesize_answer(self, question: str, context: str, cot_output: dict | None = None) -> str:
"""Synthesize final answer WITHOUT leaking CoT reasoning."""
# Use ONLY the clean context, ignore CoT internals
prompt_variables = {
'question': question,
'context': context,
# Do NOT include: cot_keywords, reasoning_steps, analysis, etc.
}
template = self.get_answer_template() # Clean template
return self.llm_provider.generate_text(
user_id=self.user_id,
prompt=context,
template=template,
variables=prompt_variables
)Option 2: Add CoT Output Filtering
def filter_cot_output(cot_result: dict) -> str:
"""Extract only the clean final answer from CoT output."""
# Return ONLY the synthesized answer, not the reasoning
return cot_result.get('final_answer', cot_result.get('answer', ''))Option 3: Improve Prompt Template
Create a new prompt template type: ANSWER_SYNTHESIS
You are a helpful assistant. Answer the user's question based ONLY on the provided context.
Rules:
- Use only information from the context
- Do not include reasoning, analysis, or meta-commentary
- Provide a direct, concise answer
- If the context doesn't contain the answer, state this clearly
Question: {{question}}
Context: {{context}}
Answer:
Option 4: Disable CoT for Simple Queries
def should_use_chain_of_thought(question: str) -> bool:
"""Determine if CoT is beneficial for this query."""
# Disable CoT for simple factual queries
simple_patterns = [
r'^what (is|are|was|were)\s+',
r'^(how much|how many)\s+',
r'^when (did|was|were)\s+',
]
if any(re.match(pattern, question.lower()) for pattern in simple_patterns):
return False # Simple query, no CoT needed
return TrueFiles to Investigate
backend/rag_solution/services/answer_synthesizer.py- Primary suspectbackend/rag_solution/services/chain_of_thought_service.py- CoT orchestrationbackend/rag_solution/services/search_service.py- Integration point- Database:
prompt_templatestable - Check for malformed templates
Reproduction Steps
- Use collection:
test-slate-768-dims-NEW(ID:5eb82bd8-1fbd-454e-86d6-61199642757c) - User ID:
ee76317f-3b6f-4fea-8b74-56483731f58c - Query: "What was the total amount spent on research, development, and engineering in 2022?"
- Observe garbage response with leaked CoT reasoning
Test Cases After Fix
Test 1: Simple Factual Query
Query: "What was IBM's revenue in 2021?"
Expected: Clean answer with revenue figure or "not available in context"
NOT: Internal reasoning, hallucinated conversations, or instructions
Test 2: Complex Multi-Part Query
Query: "How did Kyndryl separation impact IBM's financial performance and strategic focus?"
Expected: Structured answer addressing both parts
NOT: Leaked CoT keywords or analysis artifacts
Test 3: Query with No Answer
Query: "What was IBM's stock price on January 1, 2025?"
Expected: "This information is not available in the provided context."
NOT: Hallucinated data or circular reasoning
Success Criteria
✅ Responses contain ONLY clean, user-facing content
✅ No leaked CoT reasoning: "(in the context of...)"
✅ No internal instructions: "instruction:", "response:"
✅ No hallucinated multi-turn conversations
✅ Response length appropriate to query complexity (100-300 tokens typical)
✅ All existing CoT tests still pass
Related Issues
- Issue 🧠 Implement Chain of Thought (CoT) Reasoning for Enhanced RAG Search Quality #136: Chain of Thought implementation (original feature)
- Issue Redesign .env to Database Configuration Sync #458: Configuration sync (.env ↔ DB)
- Issue Fix: Reranker fails with missing template - needs fallback or template seeding #459: Reranker template missing
- Issue Fix: ConversationMessageInput character limit too small for large responses #460: Character limit validation
Priority
P0 - CRITICAL: Blocks production deployment. Must fix before any release.
Labels
bug, critical, search, chain-of-thought, prompt-engineering