Skip to content

Improve token estimation accuracy for multibyte text #742

@bug-ops

Description

@bug-ops

Parent: #740 (P0)

Problem

Current estimate_tokens() uses bytes/3 heuristic which overestimates on multibyte text (Cyrillic, CJK — up to 2-3x). This causes premature context compaction and inaccurate budget allocation.

Solution

  • Replace text.len() / 3 with text.chars().count() / 4
  • Add configurable safety margin (default 1.0, recommended 1.2 for production)
  • Optionally support tiktoken-rs behind a feature flag for precise counting with cloud providers

Affected crates

  • zeph-memory (estimate_tokens function)
  • zeph-core (all call sites)

Acceptance criteria

  • Estimation accuracy within 20% on mixed ASCII/Cyrillic/CJK text
  • Safety margin configurable via memory.token_safety_margin
  • Existing tests updated

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions