Skip to content

M21: Token optimization -- reduce API token consumption #336

@bug-ops

Description

@bug-ops

Problem

Usage analysis shows catastrophic token waste: ~1.9M input tokens vs ~9.8K output (200:1 ratio) with zero cache hits across all requests.

Root Causes

  1. No prompt caching -- ClaudeProvider sends system prompt + skills as plain text every request. Anthropic prompt caching API is not used.
  2. Tool loop amplification -- each iteration of process_response_native_tools resends the full message history. With 10 iterations, that is 10x the system prompt + history.
  3. LLM-based summarization uses primary model -- summarize_tool_output and compact_context make separate Claude API calls.
  4. Bloated system prompt -- rebuild_system_prompt injects skills + catalog + environment + tool catalog + MCP prompt + project configs + repo map.

Estimated Impact

Optimization Token Reduction Effort
Prompt caching 80-90% Medium
Local model for summarization Eliminates extra API calls Low
Aggressive context pruning 30-50% of history Low
Usage metrics Observability Low

Phases

Architecture

See `.local/plan/m21-token-optimization.md`

Key Files

  • `crates/zeph-llm/src/claude.rs`
  • `crates/zeph-core/src/agent/streaming.rs`
  • `crates/zeph-core/src/agent/context.rs`
  • `crates/zeph-llm/src/provider.rs`

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestepicMilestone-level tracking issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions