Skip to content

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

added tiktoken for embedding token estimation

Type of Change

  • Bug fix

Testing

Tested manually.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Oct 13, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs Skipped Skipped Oct 13, 2025 6:49pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

Replaces token estimation heuristics with tiktoken for accurate token counting in embeddings and chunking. This ensures compliance with OpenAI's 8,191 token limit per embedding request.

Key changes:

  • Integrated tiktoken library for precise token counting matching OpenAI's behavior
  • Replaced fixed batch sizes (50 items) with token-aware batching (8,000 tokens/batch)
  • Reduced JSON chunk sizes from 2000→1000 tokens (target) and 3000→1500 tokens (max) for safer margins
  • Added support for JSON/YAML file uploads
  • Added fallback to estimation when tiktoken fails

Issues found:

  • Memory leak: tiktoken encodings are cached but never freed - clearEncodingCache() function exists but is never called
  • Type safety: as any assertion bypasses TypeScript safety

Confidence Score: 3/5

  • Safe to merge with minor memory leak that should be addressed post-merge
  • Core logic is sound and improves accuracy significantly. However, tiktoken encodings are never freed causing a memory leak in long-running processes. The cache is small (typically 1-3 models) so impact is limited, but should be fixed. Type assertion issue is minor.
  • apps/sim/lib/tokenization/estimators.ts - needs cleanup mechanism for encoding cache

Important Files Changed

File Analysis

Filename Score Overview
apps/sim/lib/tokenization/estimators.ts 3/5 Added tiktoken integration with caching, accurate token counting, and batching utilities; potential memory leak from encodings never freed
apps/sim/lib/embeddings/utils.ts 4/5 Replaced fixed batch size with token-aware batching using tiktoken, improved logging for better observability
apps/sim/lib/chunkers/json-yaml-chunker.ts 4/5 Switched from estimation to accurate tiktoken counts, reduced chunk sizes for safety, added yaml parsing support

Sequence Diagram

sequenceDiagram
    participant Client
    participant API
    participant EmbeddingUtils
    participant Tokenization
    participant Tiktoken
    participant OpenAI

    Client->>API: Upload JSON or YAML file
    API->>API: Validate file extension
    Note over API: json, yaml, yml now allowed
    
    API->>JsonYamlChunker: chunk content
    JsonYamlChunker->>Tokenization: getAccurateTokenCount
    Tokenization->>Tiktoken: encode text
    Tiktoken-->>Tokenization: token count
    Note over JsonYamlChunker: Reduced chunk sizes<br/>1000 target 1500 max
    JsonYamlChunker-->>API: chunks array
    
    API->>EmbeddingUtils: generateEmbeddings
    EmbeddingUtils->>Tokenization: batchByTokenLimit with 8000 max
    Tokenization->>Tiktoken: count tokens for each text
    Tiktoken-->>Tokenization: token counts
    Tokenization-->>EmbeddingUtils: batches array
    Note over Tokenization: Token-aware batching<br/>replaces fixed batches
    
    loop For each batch
        EmbeddingUtils->>OpenAI: Request embeddings
        Note over EmbeddingUtils,OpenAI: Max 8000 tokens per batch
        OpenAI-->>EmbeddingUtils: embeddings array
    end
    
    EmbeddingUtils-->>API: all embeddings
    API-->>Client: Success response
Loading

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@blacksmith-sh blacksmith-sh bot deleted a comment from waleedlatif1 Oct 13, 2025
@waleedlatif1 waleedlatif1 merged commit 1e81cd6 into staging Oct 13, 2025
9 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/knowledgeb branch October 13, 2025 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants