fix(kb): added tiktoken for embedding token estimation #1616

waleedlatif1 · 2025-10-13T18:34:45Z

Summary

added tiktoken for embedding token estimation

Type of Change

Bug fix

Testing

Tested manually.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

vercel · 2025-10-13T18:34:50Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Preview	Comments	Updated (UTC)
docs	Skipped			Oct 13, 2025 6:49pm

greptile-apps

Greptile Overview

Summary

Replaces token estimation heuristics with tiktoken for accurate token counting in embeddings and chunking. This ensures compliance with OpenAI's 8,191 token limit per embedding request.

Key changes:

Integrated tiktoken library for precise token counting matching OpenAI's behavior
Replaced fixed batch sizes (50 items) with token-aware batching (8,000 tokens/batch)
Reduced JSON chunk sizes from 2000→1000 tokens (target) and 3000→1500 tokens (max) for safer margins
Added support for JSON/YAML file uploads
Added fallback to estimation when tiktoken fails

Issues found:

Memory leak: tiktoken encodings are cached but never freed - clearEncodingCache() function exists but is never called
Type safety: as any assertion bypasses TypeScript safety

Confidence Score: 3/5

Safe to merge with minor memory leak that should be addressed post-merge
Core logic is sound and improves accuracy significantly. However, tiktoken encodings are never freed causing a memory leak in long-running processes. The cache is small (typically 1-3 models) so impact is limited, but should be fixed. Type assertion issue is minor.
apps/sim/lib/tokenization/estimators.ts - needs cleanup mechanism for encoding cache

Important Files Changed

File Analysis

Filename	Score	Overview
apps/sim/lib/tokenization/estimators.ts	3/5	Added tiktoken integration with caching, accurate token counting, and batching utilities; potential memory leak from encodings never freed
apps/sim/lib/embeddings/utils.ts	4/5	Replaced fixed batch size with token-aware batching using tiktoken, improved logging for better observability
apps/sim/lib/chunkers/json-yaml-chunker.ts	4/5	Switched from estimation to accurate tiktoken counts, reduced chunk sizes for safety, added yaml parsing support

Sequence Diagram

sequenceDiagram
    participant Client
    participant API
    participant EmbeddingUtils
    participant Tokenization
    participant Tiktoken
    participant OpenAI

    Client->>API: Upload JSON or YAML file
    API->>API: Validate file extension
    Note over API: json, yaml, yml now allowed
    
    API->>JsonYamlChunker: chunk content
    JsonYamlChunker->>Tokenization: getAccurateTokenCount
    Tokenization->>Tiktoken: encode text
    Tiktoken-->>Tokenization: token count
    Note over JsonYamlChunker: Reduced chunk sizes<br/>1000 target 1500 max
    JsonYamlChunker-->>API: chunks array
    
    API->>EmbeddingUtils: generateEmbeddings
    EmbeddingUtils->>Tokenization: batchByTokenLimit with 8000 max
    Tokenization->>Tiktoken: count tokens for each text
    Tiktoken-->>Tokenization: token counts
    Tokenization-->>EmbeddingUtils: batches array
    Note over Tokenization: Token-aware batching<br/>replaces fixed batches
    
    loop For each batch
        EmbeddingUtils->>OpenAI: Request embeddings
        Note over EmbeddingUtils,OpenAI: Max 8000 tokens per batch
        OpenAI-->>EmbeddingUtils: embeddings array
    end
    
    EmbeddingUtils-->>API: all embeddings
    API-->>Client: Success response

_{7 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

apps/sim/lib/tokenization/estimators.ts

fix(kb): added tiktoken for embedding token estimation

235db26

vercel bot deployed to Preview October 13, 2025 18:35 View deployment

greptile-apps bot reviewed Oct 13, 2025

View reviewed changes

apps/sim/lib/tokenization/estimators.ts Show resolved Hide resolved

apps/sim/lib/tokenization/estimators.ts Outdated Show resolved Hide resolved

added missing mock

bdbdbc0

vercel bot temporarily deployed to Preview October 13, 2025 18:49 Inactive

blacksmith-sh bot deleted a comment from waleedlatif1 Oct 13, 2025

waleedlatif1 merged commit 1e81cd6 into staging Oct 13, 2025
9 checks passed

waleedlatif1 deleted the fix/knowledgeb branch October 13, 2025 18:53

icecrasher321 mentioned this pull request Oct 13, 2025

v0.4.13: bugfixes for dev containers, posthog redirect, helm updates #1621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kb): added tiktoken for embedding token estimation #1616

fix(kb): added tiktoken for embedding token estimation #1616

Uh oh!

waleedlatif1 commented Oct 13, 2025

Uh oh!

vercel bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(kb): added tiktoken for embedding token estimation #1616

fix(kb): added tiktoken for embedding token estimation #1616

Uh oh!

Conversation

waleedlatif1 commented Oct 13, 2025

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Oct 13, 2025 •

edited

Loading