feat(knowledge): add token, sentence, recursive, and regex chunkers#4102
feat(knowledge): add token, sentence, recursive, and regex chunkers#4102waleedlatif1 wants to merge 15 commits intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Wires Refactors/extends the chunking implementation: extracts shared helpers into Reviewed by Cursor Bugbot for commit ec6fa58. Configure here. |
Greptile SummaryThis PR adds four new chunking strategies (token, sentence, recursive, regex) to the knowledge base feature, extracts shared utilities into
Confidence Score: 4/5Mergeable after addressing the bypassable ReDoS check in RegexChunker; the rest of the stack is solid. One P1 security finding (ReDoS guard can be bypassed by crafted patterns safe against the test string but catastrophic against monotone user content) requires attention before merge. The remaining findings are P2. apps/sim/lib/chunkers/regex-chunker.ts — ReDoS timing guard needs replacement with a static analysis approach
|
| Filename | Overview |
|---|---|
| apps/sim/lib/chunkers/regex-chunker.ts | New regex-based chunker with a timing-based ReDoS guard that is bypassable by patterns safe against the fixed test string but catastrophic against monotone user content. |
| apps/sim/lib/chunkers/utils.ts | New shared utilities; buildChunks index tracking drifts from true document positions when overlap is applied. |
| apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx | Strategy selector UI added; 'text' option labeled 'hierarchical splitting' which is misleading — that description fits the recursive chunker instead. |
| apps/sim/app/api/knowledge/route.ts | Zod schema extended with strategy/strategyOptions fields including a cross-field refine requiring regex pattern when strategy is 'regex'; correct and complete. |
| apps/sim/lib/knowledge/documents/document-processor.ts | Strategy dispatch logic added; explicit strategies bypass auto-detection cleanly, fallback to auto-detect when strategy is 'auto' or undefined is preserved. |
| apps/sim/lib/chunkers/types.ts | New types ChunkingStrategy, StrategyOptions, and per-chunker option interfaces added; old conflicting strategy union in ExtendedChunkingConfig correctly removed. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
UI["CreateBaseModal\nstrategy selector"] -->|POST chunkingConfig| API["POST /api/knowledge\nZod validation"]
API -->|createKnowledgeBase| KB_DB[("knowledge_base.chunkingConfig\nstrategy + strategyOptions")]
KB_DB -->|processDocumentAsync| SVC["documents/service.ts\nreads rawConfig.strategy"]
SVC -->|processDocument| DP["document-processor.ts"]
DP -->|strategy not auto| DISPATCH["applyStrategy()"]
DP -->|strategy auto or undefined| AUTO["Auto-detect by MIME/content"]
DISPATCH -->|token| TC["TokenChunker"]
DISPATCH -->|sentence| SC["SentenceChunker"]
DISPATCH -->|recursive| RC["RecursiveChunker"]
DISPATCH -->|regex| RGX["RegexChunker"]
DISPATCH -->|text / default| TX["TextChunker"]
AUTO -->|JSON/YAML| JY["JsonYamlChunker"]
AUTO -->|CSV/spreadsheet| SD["StructuredDataChunker"]
AUTO -->|other| TX
Comments Outside Diff (1)
-
apps/sim/lib/chunkers/utils.ts, line 942-969 (link)buildChunksindex tracking is inaccurate when overlap is activeWhen
overlapTokens > 0,buildChunksestimates the overlap length asMath.min(overlapChars, prevChunk.length, text.length), but the text passed in has already been modified byaddOverlap— which trims to a word boundary, so the actual prepended overlap may be shorter thanoverlapChars. As a resultstartIndexis over-subtracted andendIndexundershoots, causing themetadata.startIndex / endIndexoffsets to drift from their true positions in the original document as chunk index increases.If these offsets are used for document-level highlighting or retrieval provenance, they will be wrong for all but the first chunk. Consider computing the offsets before the overlap step, or tracking the real overlap length returned by
addOverlap.
Reviews (1): Last reviewed commit: "feat(knowledge): add token, sentence, re..." | Re-trigger Greptile
...sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx
Outdated
Show resolved
Hide resolved
- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils - Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio) - Fix DocsChunker operator precedence bug and hard-coded 300-token limit - Fix JsonYamlChunker isStructuredData false positive on plain strings - Add MAX_DEPTH recursion guard to JsonYamlChunker - Replace @/components/ui/select with emcn DropdownMenu in strategy selector
- Expand RecursiveChunker recipes: markdown adds horizontal rules, code fences, blockquotes; code adds const/let/var/if/for/while/switch/return - RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing - RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces) - SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months and single-capital-letter lookbehind - Add overlap < maxSize validation in Zod schema and UI form - Add pattern max length (500) validation in Zod schema - Fix StructuredDataChunker footer grammar
- DocsChunker: extract headers from cleaned content (not raw markdown) to fix position mismatch between header positions and chunk positions - DocsChunker: strip export statements and JSX expressions in cleanContent - DocsChunker: fix table merge dedup using equality instead of includes - JsonYamlChunker: preserve path breadcrumbs when nested value fits in one chunk, matching LangChain RecursiveJsonSplitter behavior - StructuredDataChunker: detect 2-column CSV (lowered threshold from >2 to >=1) and use 20% relative tolerance instead of absolute +/-2 - TokenChunker: use sliding window overlap (matching LangChain/Chonkie) where chunks stay within chunkSize instead of exceeding it - utils: splitAtWordBoundaries accepts optional stepChars for sliding window overlap; addOverlap uses newline join instead of space
- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals - Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings - Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters - Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk - Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently - Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0 - Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures) - Fix existing test expectations for updated footer format and isStructuredData behavior
Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files.
- Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)"
Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent.
|
@greptile |
|
@cursor review |
|
Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance. |
When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker).
|
@cursor review |
- Restore /3 token estimation for StructuredDataChunker (structured data is denser than prose, ~3 chars/token vs ~4) - Change addOverlap joiner from \n to space to match original TextChunker behavior
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a53f760. Configure here.
When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct.
|
@greptile |
|
@cursor review |
- Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list

Summary
Type of Change
Testing
Tested manually. All 53 existing chunker tests pass.
Checklist