-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Adopt LLM Best Practices: Native Markdown Output Instead of HTML Conversion
Problem Statement
Current Implementation (Post-Issue #655):
LLM → HTML output → html2text conversion → Markdown → Frontend rendering
Issues with Current Approach:
- Wrong output format: Asking LLMs to generate HTML, which they're not optimized for
- Extra processing: Conversion step adds 100-200ms latency and potential errors
- Lost semantic meaning: HTML tags don't convey intent to LLM during generation
- No validation: Cannot validate HTML structure before conversion
- Inconsistency: LLMs sometimes return HTML, sometimes plain text, sometimes mixed
Industry Best Practices
What Leading AI Companies Do
OpenAI (from their documentation):
"For formatted output, request Markdown. Models perform better with Markdown than HTML."
Anthropic (from Claude documentation):
"Claude excels at Markdown formatting. For tables, use pipe-delimited syntax."
LangChain (from best practices):
"Prefer Markdown for LLM outputs. Use Pydantic models for structured data."
Why Markdown Works Better
Modern LLMs are trained on:
- ✅ Markdown (GitHub, StackOverflow, documentation sites)
- ✅ Plain text (books, articles, conversations)
- ✅ Code (with proper syntax highlighting)
They are NOT optimized for HTML generation because:
- ❌ HTML is verbose (
<table><tr><td>vs|---|) - ❌ Requires strict syntax (
<p>text</p>closing tags) - ❌ Semantic tags don't help LLM understand formatting intent
Proposed Solution
Approach: Native Markdown Output (Recommended)
New Flow:
LLM → Native Markdown → Frontend rendering (direct)
Benefits:
- ✅ No conversion step (100-200ms faster)
- ✅ LLM-native format (90-95% consistency vs 60-70% current)
- ✅ Simpler codebase (remove html2text dependency)
- ✅ Better validation (can validate Markdown syntax pre-response)
- ✅ More predictable output
Implementation Plan
Phase 1: Prompt Engineering (No Code Changes)
Timeline: 1-2 hours
Effort: Low
Expected Improvement: 70-80% of queries use proper Markdown
Changes:
- Update default system prompts to explicitly request Markdown
- Add formatting examples to prompt templates
- Test with sample queries
Example Updated System Prompt:
system_prompt = """You are a RAG search assistant.
CRITICAL FORMATTING RULES:
1. ALWAYS use Markdown formatting in your responses
2. NEVER use HTML tags
3. Structure your output as follows:
For tables (revenue, metrics, comparisons):
| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data | Data | Data |
For lists:
- Bullet point 1
- Bullet point 2
For emphasis:
**bold** for key terms
*italic* for context
For code/technical terms:
`inline code` orcode block
For headings:
## Section Title
### Subsection
EXAMPLE OUTPUT:
Based on the provided documents, IBM revenue changed as follows:
| Year | Revenue | Change |
|------|---------|--------|
| 2021 | $57.4B | +0.3% |
| 2022 | $60.5B | +5.4% |
| 2023 | $61.9B | +2.3% |
**Key Finding**: Revenue showed consistent growth with strongest performance in 2022.
"""
Files to Update (via UI/API, no deployment needed):
- User prompt templates in PostgreSQL database
- Access via
/api/v1/prompts/endpoints
Phase 2: Remove HTML Conversion (Code Changes)
Timeline: 2-4 hours
Effort: Medium
Expected Improvement: 90-95% consistency, faster processing
Changes Required:
1. Simplify _clean_generated_answer() Method
File: backend/rag_solution/services/search_service.py (lines 353-439)
BEFORE (Current - with HTML conversion):
def _clean_generated_answer(self, answer: str) -> str:
"""Clean generated answer by removing artifacts and duplicates.
Removes:
- " AND " artifacts from query rewriting
- Duplicate consecutive words
- Leading/trailing whitespace
Converts:
- HTML formatting to Markdown (tables, bold, italic, links, lists, etc.)
"""
import re
import html2text
cleaned = answer.strip()
# Convert HTML to Markdown if HTML tags detected
if "<" in cleaned and ">" in cleaned:
html_patterns = [...] # Long list of HTML patterns
has_html = any(re.search(pattern, cleaned, re.IGNORECASE) for pattern in html_patterns)
if has_html:
h = html2text.HTML2Text()
h.body_width = 0
# ... many configuration lines
cleaned = h.handle(cleaned)
# Remove artifacts and duplicates
# ... rest of methodAFTER (Simplified - LLMs output Markdown directly):
def _clean_generated_answer(self, answer: str) -> str:
"""Clean generated answer by removing artifacts only.
Removes:
- " AND " artifacts from query rewriting
- Duplicate consecutive words
- Leading/trailing whitespace
Note: No HTML conversion needed - LLMs output native Markdown.
"""
import re
cleaned = answer.strip()
# Remove " AND " artifacts from query rewriting
cleaned = re.sub(r"\s+AND\s+", " ", cleaned)
cleaned = re.sub(r"\s+AND$", "", cleaned)
# Remove duplicate consecutive words
words = cleaned.split()
deduplicated_words = []
prev_word = None
for word in words:
if not prev_word or word.lower() != prev_word.lower():
deduplicated_words.append(word)
prev_word = word
result = " ".join(deduplicated_words)
result = re.sub(r"\s+", " ", result).strip()
return result
# No HTML conversion - LLM outputs Markdown natively2. Update Default Prompt Templates
File: backend/rag_solution/schemas/prompt_template_schema.py (lines 10-24)
BEFORE (Current):
DEFAULT_STRUCTURED_OUTPUT_TEMPLATE = """Question: {question}
Context Documents:
{context}
Please provide a structured answer with:
1. A clear, concise answer to the question
2. A confidence score (0.0-1.0) based on the quality and relevance of the sources
3. Citations to specific documents that support your answer
"""AFTER (With explicit Markdown instructions):
DEFAULT_STRUCTURED_OUTPUT_TEMPLATE = """Question: {question}
Context Documents:
{context}
RESPONSE FORMAT REQUIREMENTS:
- Use Markdown formatting exclusively (NO HTML)
- For quantitative data (revenue, statistics, comparisons), use Markdown tables
- Use **bold** for key findings
- Use bullet lists for multiple points
- Keep paragraphs concise (3-4 sentences max)
EXAMPLE MARKDOWN TABLE:
| Year | Revenue | Change |
|------|---------|--------|
| 2021 | $57.4B | +0.3% |
| 2022 | $60.5B | +5.4% |
Please provide:
1. Clear Markdown-formatted answer
2. Confidence score (0.0-1.0)
3. Citations with document_id and relevant excerpts
"""3. Remove html2text Dependency
File: pyproject.toml
REMOVE:
html2text = "^2025.4.15" # No longer neededRun:
poetry remove html2text
poetry lockPhase 3: Enable Structured Output by Default (Optional)
Timeline: 4-8 hours
Effort: High
Expected Improvement: 99% consistency, provider-level validation
Changes Required:
1. Enable Structured Output by Default
File: backend/core/config.py
# Change default from False to True
structured_output_enabled: bool = Field(default=True) # Was False2. Enhance Provider Implementations
File: backend/rag_solution/generation/providers/openai.py
def generate_structured_output(self, prompt: str) -> StructuredAnswer:
"""Generate structured output using OpenAI's native JSON schema support."""
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{
"role": "system",
"content": "Respond with Markdown formatting. Use tables for quantitative data."
},
{"role": "user", "content": prompt}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "rag_answer",
"schema": StructuredAnswer.model_json_schema()
}
}
)
return StructuredAnswer.model_validate_json(response.choices[0].message.content)File: backend/rag_solution/generation/providers/anthropic.py
def generate_structured_output(self, prompt: str) -> StructuredAnswer:
"""Generate structured output using Anthropic's XML tag approach."""
system_prompt = """
Respond using Markdown formatting for the answer content.
Structure your response in the following XML format:
<answer>Your Markdown-formatted answer here</answer>
<confidence>0.95</confidence>
<citations>
<citation>
<document_id>uuid</document_id>
<excerpt>relevant text</excerpt>
</citation>
</citations>
"""
# Implementation for XML parsing into StructuredAnswerBefore vs After Comparison
Current Flow (HTML Approach)
System Prompt:
"Answer the question based on the provided context."
LLM Output (inconsistent):
<p>IBM revenue changed as follows:</p>
<table>
<tr><th>Year</th><th>Revenue</th></tr>
<tr><td>2021</td><td>$57.4B</td></tr>
<tr><td>2022</td><td>$60.5B</td></tr>
</table>Processing:
HTML → html2text (100-200ms) → Markdown → Frontend
Problems:
- Extra latency from conversion
- Potential conversion errors
- LLM not optimized for HTML
- 60-70% consistency
Proposed Flow (Markdown-Native)
System Prompt:
"Answer using Markdown formatting. Use tables for quantitative data."
LLM Output (consistent):
IBM revenue changed as follows:
| Year | Revenue |
|------|---------|
| 2021 | $57.4B |
| 2022 | $60.5B |
| 2023 | $61.9B |
**Key insight**: Revenue grew 7.8% over 3 years.Processing:
Markdown → Frontend (direct rendering, 0ms overhead)
Benefits:
- No conversion latency
- Consistent format
- LLM-native approach
- 90-95% consistency
Performance Comparison
| Metric | Current (HTML) | Proposed (Markdown) | Improvement |
|---|---|---|---|
| LLM Optimization | Low (not trained on HTML) | High (trained on Markdown) | ✅ |
| Consistency | 60-70% | 90-95% | +30-35% |
| Processing Time | +100-200ms (conversion) | 0ms (direct) | -100-200ms |
| Code Complexity | High (html2text + config) | Low (simple cleaning) | ✅ Simpler |
| Dependencies | html2text library | None | ✅ Removed |
| Validation | Post-conversion | Pre-response | ✅ Earlier |
| Error Handling | Complex (HTML parsing) | Simple (text validation) | ✅ Easier |
Testing Plan
Phase 1 Testing (Prompt Engineering)
- Update prompt templates via UI/API
- Test queries:
- "How did IBM revenue change over the years?" (expects table)
- "What are the key features of product X?" (expects bullet list)
- "Explain the architecture" (expects headings + paragraphs)
- Measure:
- Markdown compliance rate (target: 70-80%)
- Format correctness (tables, lists, headings)
- Validate across all LLM providers (OpenAI, Anthropic, WatsonX)
Phase 2 Testing (Code Changes)
- Unit tests:
- Test
_clean_generated_answer()with Markdown input - Verify artifact removal still works
- Ensure no HTML conversion attempted
- Test
- Integration tests:
- Full search flow with Markdown output
- Verify ReactMarkdown rendering
- Test with various formatting (tables, lists, code blocks)
- Regression tests:
- Ensure existing queries still work
- Verify no performance degradation
- Check all three LLM providers
Phase 3 Testing (Structured Output)
- Provider-specific tests:
- OpenAI JSON schema validation
- Anthropic XML parsing
- WatsonX structured response
- Schema validation tests:
- Pydantic model validation
- Error handling for malformed responses
- Confidence score validation
- End-to-end tests:
- Full search with structured output
- Citation attribution
- Metadata extraction
Success Metrics
Phase 1 Success Criteria
- 70-80% of LLM responses use proper Markdown formatting
- Tables render correctly in frontend (ReactMarkdown)
- No degradation in answer quality
- User satisfaction maintained or improved
Phase 2 Success Criteria
- 90-95% Markdown formatting consistency
- 100-200ms latency improvement (no HTML conversion)
- Zero HTML conversion errors
- Code complexity reduced (fewer lines, no html2text)
- All existing tests pass
Phase 3 Success Criteria
- 99% structured output compliance
- Provider-level validation working (OpenAI, Anthropic, WatsonX)
- Zero parsing errors
- Structured citations properly attributed
Rollback Plan
Phase 1 (Prompt Engineering)
- Rollback: Update prompts back to generic version via UI
- Risk: Low (no code changes)
- Time: 5 minutes
Phase 2 (Code Changes)
- Rollback:
git revertcommit, redeploy with html2text - Risk: Medium (code changes)
- Time: 15-30 minutes
Phase 3 (Structured Output)
- Rollback: Set
structured_output_enabled: falsein config - Risk: Medium (provider changes)
- Time: 5 minutes (config), 30 minutes (full revert)
Dependencies
Removals
- ❌
html2textlibrary (no longer needed after Phase 2)
Additions
- None (using existing capabilities)
Affected Services
SearchService(search_service.py)ConversationService(conversation_service.py)PromptTemplateService(prompt_template_service.py)- All LLM providers (openai.py, anthropic.py, watsonx.py)
Related Issues & Documentation
Related Issues
- Performance & UX Improvements: Search Speed, Table Formatting, and Prompt Hot-Reload #655 - Performance & UX Improvements (HTML→Markdown conversion implemented)
Documentation to Update
docs/api/search_api.md- Update prompt examplesdocs/development/backend/index.md- Update service architectureREADME.md- Update formatting capabilities
Industry References
Implementation Timeline
Week 1: Phase 1 (Prompt Engineering)
- Day 1: Update default prompt templates
- Day 2-3: Test with sample queries across providers
- Day 4-5: Iterate based on results, measure consistency
Week 2: Phase 2 (Code Changes)
- Day 1: Simplify
_clean_generated_answer()method - Day 2: Update prompt template constants
- Day 3: Remove html2text dependency, update tests
- Day 4-5: Integration testing, performance validation
Week 3: Phase 3 (Optional - Structured Output)
- Day 1-2: Enhance provider implementations
- Day 3-4: Schema validation and error handling
- Day 5: End-to-end testing and documentation
Conclusion
Key Insight: Stop asking LLMs to do what they're not good at (HTML generation) and leverage what they excel at (Markdown formatting).
Recommended Approach:
- ✅ Start with Phase 1 (Prompt Engineering) - Quick win, no code changes
- ✅ Then Phase 2 (Remove HTML Conversion) - Cleaner code, better performance
- ⏭️ Optional Phase 3 (Structured Output) - Maximum reliability for production scale
Expected Overall Improvement:
- 30-35% better formatting consistency (60% → 90-95%)
- 100-200ms faster response times (no conversion)
- Simpler codebase (remove html2text dependency)
- Better user experience (consistent, well-formatted responses)
Labels
enhancementperformancellm-optimizationmarkdownprompt-engineering
Milestone
- Version 0.9.0
Assignees
TBD