Feature/prompt templates and lmstudio sdk #171

ww2283 · 2025-11-14T17:11:53Z

What does this PR do?

Summary

This PR adds two complementary features that enhance LEANN's embedding capabilities:

Prompt Template Support: Enables task-specific embedding models (e.g., Google's EmbeddingGemma) that require
different prompts for documents vs. queries
LM Studio SDK Integration: Discovers model token limits via TypeScript SDK bridge for accurate text truncation

Key Features

Prompt Templates

Separate --embedding-prompt-template (for indexing) and --query-prompt-template (for search)
Automatic template persistence in index metadata
Auto-applied during MCP/CLI searches (no manual configuration needed)
Backward compatible: single --prompt-template still supported

LM Studio SDK Bridge

Queries model metadata (context length) via Node.js subprocess
Follows hybrid discovery pattern (dynamic for capable backends, registry fallback for others)
Respects JIT auto-evict settings (properly unloads models after metadata query)
Token limit caching to prevent duplicate SDK calls

Bug Fixes

Fixed query template application in compute_query_embedding() (was only applied in fallback path, not server
path)
MCP and CLI searches now correctly use stored query templates

Testing

154/159 tests passing (5 expected RAG failures - environmental)
All pre-commit checks passing
Comprehensive test coverage:
- 37 prompt template tests (consolidated from 44)
- Integration tests for both Ollama and LM Studio modes
- End-to-end tests with real embedding servers
- Token limit discovery tests

Documentation

Updated configuration guide with prompt template usage examples
Added FAQ entries for common template questions
Documented LM Studio SDK requirements and setup

Files Changed

Core: 3092 insertions across 15 files
Tests: 2409 new test lines (5 new test files)
Docs: 137 new documentation lines

Migration Guide

No breaking changes. Existing indexes work as-is. New features are opt-in via CLI flags.

Example usage:

Build with EmbeddingGemma (task-specific templates)

leann build my-index ./docs
--embedding-mode ollama
--embedding-model embeddinggemma
--embedding-prompt-template "title: none | text: "
--query-prompt-template "task: search result | query: "

Search automatically applies query template

leann search my-index "How does LEANN work?"

Related Issues

fullfil #155

Checklist

Tests pass (uv run pytest)
Code formatted (ruff format and ruff check)
Pre-commit hooks pass (pre-commit run --all-files)

Features: - Prompt template support for embedding models (via --embedding-prompt-template) - LM Studio SDK integration for automatic context length detection - Hybrid token limit discovery (Ollama → LM Studio → Registry → Default) - Client-side token truncation to prevent silent failures - Automatic persistence of embedding_options to .meta.json Implementation: - Added _query_lmstudio_context_limit() with Node.js subprocess bridge - Modified compute_embeddings_openai() to apply prompt templates before truncation - Extended CLI with --embedding-prompt-template flag for build and search - URL detection for LM Studio (port 1234 or lmstudio/lm.studio keywords) - HTTP→WebSocket URL conversion for SDK compatibility Tests: - 60 passing tests across 5 test files - Comprehensive coverage of prompt templates, LM Studio integration, and token handling - Parametrized tests for maintainability and clarity

Features: - End-to-end integration tests for prompt template with EmbeddingGemma - Integration tests for hybrid token limit discovery mechanism - Tests verify real-world functionality with live services (LM Studio, Ollama) Fixes: - LM Studio SDK bridge now uses client.embedding.load() for embedding models - Fixed NODE_PATH resolution to include npm global modules - Fixed integration test to use WebSocket URL (ws://) for SDK bridge Tests: - test_prompt_template_e2e.py: 8 integration tests covering: - Prompt template prepending with LM Studio (EmbeddingGemma) - LM Studio SDK bridge for context length detection - Ollama dynamic token limit detection - Hybrid discovery fallback mechanism (registry, default) - All tests marked with @pytest.mark.integration for selective execution - Tests gracefully skip when services unavailable Documentation: - Updated tests/README.md with integration test section - Added prerequisites and running instructions - Documented that prompt templates are ONLY for EmbeddingGemma - Added integration marker to pyproject.toml Test Results: - All 8 integration tests passing with live services - Confirmed prompt templates work correctly with EmbeddingGemma - Verified LM Studio SDK bridge auto-detects context length (2048) - Validated hybrid token limit discovery across all backends

Extends prompt template functionality from OpenAI mode to Ollama for backend consistency. Changes: - Add provider_options parameter to compute_embeddings_ollama() - Apply prompt template before token truncation (lines 1005-1011) - Pass provider_options through compute_embeddings() call chain Tests: - test_ollama_embedding_with_prompt_template: Verifies templates work with Ollama - test_ollama_prompt_template_affects_embeddings: Confirms embeddings differ with/without template - Both tests pass with live Ollama service (2/2 passing) Usage: leann build --embedding-mode ollama --embedding-prompt-template "query: " ...

Problem: SDK bridge called client.embedding.load() which loaded models into LM Studio memory and bypassed JIT auto-evict settings, causing duplicate model instances to accumulate. Root cause analysis (from Perplexity research): - Explicit SDK load() commands are treated as "pinned" models - JIT auto-evict only applies to models loaded reactively via API requests - SDK-loaded models remain in memory until explicitly unloaded Solutions implemented: 1. Add model.unload() after metadata query (line 243) - Load model temporarily to get context length - Unload immediately to hand control back to JIT system - Subsequent API requests trigger JIT load with auto-evict 2. Add token limit caching to prevent repeated SDK calls - Cache discovered limits in _token_limit_cache dict (line 48) - Key: (model_name, base_url), Value: token_limit - Prevents duplicate load/unload cycles within same process - Cache shared across all discovery methods (Ollama, SDK, registry) Tests: - TestTokenLimitCaching: 5 tests for cache behavior (integrated into test_token_truncation.py) - Manual testing confirmed no duplicate models in LM Studio after fix - All existing tests pass Impact: - Respects user's LM Studio JIT and auto-evict settings - Reduces model memory footprint - Faster subsequent builds (cached limits)

Added comprehensive documentation for new optional embedding features: Configuration Guide (docs/configuration-guide.md): - New section: "Optional Embedding Features" - Task-Specific Prompt Templates subsection: - Explains EmbeddingGemma use case with document/query prompts - CLI and Python API examples - Clear warnings about compatible vs incompatible models - References to GitHub issue yichuan-w#155 and HuggingFace blog - LM Studio Auto-Detection subsection: - Prerequisites (Node.js + @lmstudio/sdk) - How auto-detection works (4-step process) - Benefits and optional nature clearly stated FAQ (docs/faq.md): - FAQ yichuan-w#2: When should I use prompt templates? - DO/DON'T guidance with examples - Links to detailed configuration guide - FAQ yichuan-w#3: Why is LM Studio loading multiple copies? - Explains the JIT auto-evict fix - Troubleshooting steps if still seeing issues - FAQ yichuan-w#4: Do I need Node.js and @lmstudio/sdk? - Clarifies it's completely optional - Lists benefits if installed - Installation instructions Cross-references between documents for easy navigation between quick reference and detailed guides.

Task-specific models like EmbeddingGemma require different templates for indexing vs searching. Store both templates at build time and auto-apply query template during search with backward compatibility.

Merged redundant no-op tests, removed low-value implementation tests, consolidated parameterized CLI tests, and removed hanging over-mocked test. All tests pass with improved focus on behavioral testing.

Query templates were only applied in the fallback code path, not when using the embedding server (default path). This meant stored query templates in index metadata were ignored during MCP and CLI searches. Changes: - Move template application to before any computation path (searcher_base.py:109-110) - Add comprehensive tests for both server and fallback paths - Consolidate tests into test_prompt_template_persistence.py Tests verify: - Template applied when using embedding server - Template applied in fallback path - Consistent behavior between both paths

- Remove unused imports - Fix import ordering - Remove unused variables - Apply code formatting

Tests were failing in CI because compute_embeddings_openai() checks for OPENAI_API_KEY before using the mocked client. Added monkeypatch to set fake API key in test fixture.

yichuan-w · 2025-11-14T23:24:22Z

This is a really good PR, I love this, can't wait to try myself

yichuan-w · 2025-11-14T23:25:10Z

I will merge first, but @ww2283, can you have a new PR briefly talk about how to use that in the Readme and we can somehow promote in the main page in Readme

ww2283 · 2025-11-15T00:23:39Z

@yichuan-w that was quick :) I made a brief explanation in configuration-guide.md and faq.md on how to use it. I did a limited scale of testing, and the result is promising: with the template used, the ranking (not scoring) from the embeddinggemma Q4 qat variant matches qwen embedding 0.6B Q8. The major difference (or benefit) is speed: this text-embedding-embeddinggemma-300m-qat is about 4-5 times faster than text-embedding-qwen3-embedding-0.6b on my mac M1 Pro when hosted in LM Studio. On my end I have tested quite a handful of embedding models, the goal is to strike a balance between performance and speed, because I use hook with commit in claude code, so the embedding has to be fast. In short, the fastest model so far is this text-embedding-embeddinggemma-300m-qat in LM Studio and then nomic-embed-text-v2 in Ollama. Yes, the hosting service matters, for unknown reasons. nomic-embed-text-v2 is a moe model and that's why it is fast, but it has only 512 seq length which is really not very useful.

If your team has an established benchmark perhaps it would be appropriate to test the template prepending approach first to see if it worth the advertising in main README?

yichuan-w · 2025-11-15T01:31:23Z

That was excellent, good to know that, I think we do not have that careful benchmark about accuracy but I agree with your experience! We can directly say that in Readme. RIght now the repo is totally community driven.

ww2283 added 10 commits November 11, 2025 16:31

Add separate build/query template support for task-specific models

3e27439

Task-specific models like EmbeddingGemma require different templates for indexing vs searching. Store both templates at build time and auto-apply query template during search with backward compatibility.

Consolidate prompt template tests from 44 to 37 tests

292f065

Merged redundant no-op tests, removed low-value implementation tests, consolidated parameterized CLI tests, and removed hanging over-mocked test. All tests pass with improved focus on behavioral testing.

Apply ruff formatting and fix linting issues

f16ec73

- Remove unused imports - Fix import ordering - Remove unused variables - Apply code formatting

Fix CI test failures: mock OPENAI_API_KEY in tests

cfeb85e

Tests were failing in CI because compute_embeddings_openai() checks for OPENAI_API_KEY before using the mocked client. Added monkeypatch to set fake API key in test fixture.

yichuan-w merged commit 1ef9cba into yichuan-w:main Nov 14, 2025
27 checks passed

ww2283 mentioned this pull request Nov 15, 2025

Document prompt template feature in README #172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/prompt templates and lmstudio sdk #171

Feature/prompt templates and lmstudio sdk #171

Uh oh!

ww2283 commented Nov 14, 2025

Uh oh!

yichuan-w commented Nov 14, 2025

Uh oh!

yichuan-w commented Nov 14, 2025

Uh oh!

Uh oh!

ww2283 commented Nov 15, 2025

Uh oh!

yichuan-w commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/prompt templates and lmstudio sdk #171

Feature/prompt templates and lmstudio sdk #171

Uh oh!

Conversation

ww2283 commented Nov 14, 2025

What does this PR do?

Build with EmbeddingGemma (task-specific templates)

Search automatically applies query template

Related Issues

Checklist

Uh oh!

yichuan-w commented Nov 14, 2025

Uh oh!

yichuan-w commented Nov 14, 2025

Uh oh!

Uh oh!

ww2283 commented Nov 15, 2025

Uh oh!

yichuan-w commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants