fix(kb): added tiktoken for embedding token estimation#1616
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Contributor
There was a problem hiding this comment.
Greptile Overview
Summary
Replaces token estimation heuristics with tiktoken for accurate token counting in embeddings and chunking. This ensures compliance with OpenAI's 8,191 token limit per embedding request.
Key changes:
- Integrated tiktoken library for precise token counting matching OpenAI's behavior
- Replaced fixed batch sizes (50 items) with token-aware batching (8,000 tokens/batch)
- Reduced JSON chunk sizes from 2000→1000 tokens (target) and 3000→1500 tokens (max) for safer margins
- Added support for JSON/YAML file uploads
- Added fallback to estimation when tiktoken fails
Issues found:
- Memory leak: tiktoken encodings are cached but never freed -
clearEncodingCache()function exists but is never called - Type safety:
as anyassertion bypasses TypeScript safety
Confidence Score: 3/5
- Safe to merge with minor memory leak that should be addressed post-merge
- Core logic is sound and improves accuracy significantly. However, tiktoken encodings are never freed causing a memory leak in long-running processes. The cache is small (typically 1-3 models) so impact is limited, but should be fixed. Type assertion issue is minor.
- apps/sim/lib/tokenization/estimators.ts - needs cleanup mechanism for encoding cache
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| apps/sim/lib/tokenization/estimators.ts | 3/5 | Added tiktoken integration with caching, accurate token counting, and batching utilities; potential memory leak from encodings never freed |
| apps/sim/lib/embeddings/utils.ts | 4/5 | Replaced fixed batch size with token-aware batching using tiktoken, improved logging for better observability |
| apps/sim/lib/chunkers/json-yaml-chunker.ts | 4/5 | Switched from estimation to accurate tiktoken counts, reduced chunk sizes for safety, added yaml parsing support |
Sequence Diagram
sequenceDiagram
participant Client
participant API
participant EmbeddingUtils
participant Tokenization
participant Tiktoken
participant OpenAI
Client->>API: Upload JSON or YAML file
API->>API: Validate file extension
Note over API: json, yaml, yml now allowed
API->>JsonYamlChunker: chunk content
JsonYamlChunker->>Tokenization: getAccurateTokenCount
Tokenization->>Tiktoken: encode text
Tiktoken-->>Tokenization: token count
Note over JsonYamlChunker: Reduced chunk sizes<br/>1000 target 1500 max
JsonYamlChunker-->>API: chunks array
API->>EmbeddingUtils: generateEmbeddings
EmbeddingUtils->>Tokenization: batchByTokenLimit with 8000 max
Tokenization->>Tiktoken: count tokens for each text
Tiktoken-->>Tokenization: token counts
Tokenization-->>EmbeddingUtils: batches array
Note over Tokenization: Token-aware batching<br/>replaces fixed batches
loop For each batch
EmbeddingUtils->>OpenAI: Request embeddings
Note over EmbeddingUtils,OpenAI: Max 8000 tokens per batch
OpenAI-->>EmbeddingUtils: embeddings array
end
EmbeddingUtils-->>API: all embeddings
API-->>Client: Success response
7 files reviewed, 2 comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
added tiktoken for embedding token estimation
Type of Change
Testing
Tested manually.
Checklist