-
Notifications
You must be signed in to change notification settings - Fork 3.3k
fix(kb): added tiktoken for embedding token estimation #1616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
Replaces token estimation heuristics with tiktoken for accurate token counting in embeddings and chunking. This ensures compliance with OpenAI's 8,191 token limit per embedding request.
Key changes:
- Integrated tiktoken library for precise token counting matching OpenAI's behavior
- Replaced fixed batch sizes (50 items) with token-aware batching (8,000 tokens/batch)
- Reduced JSON chunk sizes from 2000→1000 tokens (target) and 3000→1500 tokens (max) for safer margins
- Added support for JSON/YAML file uploads
- Added fallback to estimation when tiktoken fails
Issues found:
- Memory leak: tiktoken encodings are cached but never freed -
clearEncodingCache()function exists but is never called - Type safety:
as anyassertion bypasses TypeScript safety
Confidence Score: 3/5
- Safe to merge with minor memory leak that should be addressed post-merge
- Core logic is sound and improves accuracy significantly. However, tiktoken encodings are never freed causing a memory leak in long-running processes. The cache is small (typically 1-3 models) so impact is limited, but should be fixed. Type assertion issue is minor.
- apps/sim/lib/tokenization/estimators.ts - needs cleanup mechanism for encoding cache
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| apps/sim/lib/tokenization/estimators.ts | 3/5 | Added tiktoken integration with caching, accurate token counting, and batching utilities; potential memory leak from encodings never freed |
| apps/sim/lib/embeddings/utils.ts | 4/5 | Replaced fixed batch size with token-aware batching using tiktoken, improved logging for better observability |
| apps/sim/lib/chunkers/json-yaml-chunker.ts | 4/5 | Switched from estimation to accurate tiktoken counts, reduced chunk sizes for safety, added yaml parsing support |
Sequence Diagram
sequenceDiagram
participant Client
participant API
participant EmbeddingUtils
participant Tokenization
participant Tiktoken
participant OpenAI
Client->>API: Upload JSON or YAML file
API->>API: Validate file extension
Note over API: json, yaml, yml now allowed
API->>JsonYamlChunker: chunk content
JsonYamlChunker->>Tokenization: getAccurateTokenCount
Tokenization->>Tiktoken: encode text
Tiktoken-->>Tokenization: token count
Note over JsonYamlChunker: Reduced chunk sizes<br/>1000 target 1500 max
JsonYamlChunker-->>API: chunks array
API->>EmbeddingUtils: generateEmbeddings
EmbeddingUtils->>Tokenization: batchByTokenLimit with 8000 max
Tokenization->>Tiktoken: count tokens for each text
Tiktoken-->>Tokenization: token counts
Tokenization-->>EmbeddingUtils: batches array
Note over Tokenization: Token-aware batching<br/>replaces fixed batches
loop For each batch
EmbeddingUtils->>OpenAI: Request embeddings
Note over EmbeddingUtils,OpenAI: Max 8000 tokens per batch
OpenAI-->>EmbeddingUtils: embeddings array
end
EmbeddingUtils-->>API: all embeddings
API-->>Client: Success response
7 files reviewed, 2 comments
Summary
added tiktoken for embedding token estimation
Type of Change
Testing
Tested manually.
Checklist