feat(knowledge): add token, sentence, recursive, and regex chunkers by waleedlatif1 · Pull Request #4102 · simstudioai/sim

waleedlatif1 · 2026-04-11T00:37:11Z

Summary

Add 4 new chunking strategies: token, sentence, recursive, regex
Extract shared chunking utilities (estimateTokens, cleanText, addOverlap, splitAtWordBoundaries) into utils.ts
Wire strategy selection through full stack: UI → API → service → document processor
Default to auto-detection (existing behavior), users can optionally select a strategy
Add ReDoS protection for user-supplied regex patterns
Add Zod validation for strategy options at API layer

Type of Change

New feature

Testing

Tested manually. All 53 existing chunker tests pass.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

vercel · 2026-04-11T00:37:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Apr 11, 2026 2:37am

cursor · 2026-04-11T00:37:20Z

PR Summary

Medium Risk
Expands knowledge-base ingestion with multiple new chunking strategies and wires new config through API/UI into document processing, which can change chunk boundaries/token counts and affect embedding generation. Regex-based splitting adds user-supplied patterns (with safeguards) that could still impact performance or chunk quality if misconfigured.

Overview
Adds support for selectable document chunking strategies (token, sentence, recursive, regex) while keeping auto as the default behavior.

Wires chunkingConfig.strategy and chunkingConfig.strategyOptions end-to-end (UI modal → /api/knowledge validation → KB create params/types → document processing), including additional validation (e.g., overlap < maxSize and requiring a regex pattern when using regex).

Refactors/extends the chunking implementation: extracts shared helpers into lib/chunkers/utils.ts, adds new chunker implementations and tests, updates TextChunker to use the shared utilities, tightens JSON/YAML structured-data detection, and improves DocsChunker/StructuredDataChunker token estimation and chunk formatting to better align chunk metadata and size limits.

^{Reviewed by Cursor Bugbot for commit ec6fa58. Configure here.}

greptile-apps · 2026-04-11T00:41:37Z

Greptile Summary

This PR adds four new chunking strategies (token, sentence, recursive, regex) to the knowledge base feature, extracts shared utilities into utils.ts, threads strategy selection from the UI through Zod validation at the API layer down to the document processor, and defaults to the existing auto-detection path.

Security — bypassable ReDoS guard (regex-chunker.ts): the timing check runs against a fixed mixed-character string that does not trigger catastrophic backtracking for all vulnerable patterns (e.g. (\\s+)+ or (\\w+\\s*)+), so the protection is unreliable. A static analysis approach is strongly recommended instead.
UX — mislabeled strategy option (create-base-modal.tsx): the 'text' option is labeled "Text (hierarchical splitting)" but the TextChunker does word-boundary splitting; it is the RecursiveChunker that is hierarchical.
Metadata drift (utils.ts): buildChunks index tracking becomes inaccurate under overlap because addOverlap may trim to a word boundary, leaving startIndex/endIndex offsets misaligned with true document positions.

Confidence Score: 4/5

Mergeable after addressing the bypassable ReDoS check in RegexChunker; the rest of the stack is solid.

One P1 security finding (ReDoS guard can be bypassed by crafted patterns safe against the test string but catastrophic against monotone user content) requires attention before merge. The remaining findings are P2.

apps/sim/lib/chunkers/regex-chunker.ts — ReDoS timing guard needs replacement with a static analysis approach

Security Review

ReDoS — regex-chunker.ts: user-supplied regex patterns are accepted and tested with a timing heuristic against a single fixed test string. The heuristic can be bypassed by patterns that are O(n) on the test string but exponential on real content, allowing a user to submit a pattern that hangs the document-processing worker when applied to certain uploaded files.
No secrets are exposed and no SQL/command injection vectors are introduced.

Important Files Changed

Filename	Overview
apps/sim/lib/chunkers/regex-chunker.ts	New regex-based chunker with a timing-based ReDoS guard that is bypassable by patterns safe against the fixed test string but catastrophic against monotone user content.
apps/sim/lib/chunkers/utils.ts	New shared utilities; buildChunks index tracking drifts from true document positions when overlap is applied.
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx	Strategy selector UI added; 'text' option labeled 'hierarchical splitting' which is misleading — that description fits the recursive chunker instead.
apps/sim/app/api/knowledge/route.ts	Zod schema extended with strategy/strategyOptions fields including a cross-field refine requiring regex pattern when strategy is 'regex'; correct and complete.
apps/sim/lib/knowledge/documents/document-processor.ts	Strategy dispatch logic added; explicit strategies bypass auto-detection cleanly, fallback to auto-detect when strategy is 'auto' or undefined is preserved.
apps/sim/lib/chunkers/types.ts	New types ChunkingStrategy, StrategyOptions, and per-chunker option interfaces added; old conflicting strategy union in ExtendedChunkingConfig correctly removed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    UI["CreateBaseModal\nstrategy selector"] -->|POST chunkingConfig| API["POST /api/knowledge\nZod validation"]
    API -->|createKnowledgeBase| KB_DB[("knowledge_base.chunkingConfig\nstrategy + strategyOptions")]
    KB_DB -->|processDocumentAsync| SVC["documents/service.ts\nreads rawConfig.strategy"]
    SVC -->|processDocument| DP["document-processor.ts"]
    DP -->|strategy not auto| DISPATCH["applyStrategy()"]
    DP -->|strategy auto or undefined| AUTO["Auto-detect by MIME/content"]
    DISPATCH -->|token| TC["TokenChunker"]
    DISPATCH -->|sentence| SC["SentenceChunker"]
    DISPATCH -->|recursive| RC["RecursiveChunker"]
    DISPATCH -->|regex| RGX["RegexChunker"]
    DISPATCH -->|text / default| TX["TextChunker"]
    AUTO -->|JSON/YAML| JY["JsonYamlChunker"]
    AUTO -->|CSV/spreadsheet| SD["StructuredDataChunker"]
    AUTO -->|other| TX

Comments Outside Diff (1)

apps/sim/lib/chunkers/utils.ts, line 942-969 (link)

buildChunks index tracking is inaccurate when overlap is active

When overlapTokens > 0, buildChunks estimates the overlap length as Math.min(overlapChars, prevChunk.length, text.length), but the text passed in has already been modified by addOverlap — which trims to a word boundary, so the actual prepended overlap may be shorter than overlapChars. As a result startIndex is over-subtracted and endIndex undershoots, causing the metadata.startIndex / endIndex offsets to drift from their true positions in the original document as chunk index increases.

If these offsets are used for document-level highlighting or retrieval provenance, they will be wrong for all but the first chunk. Consider computing the offsets before the overlap step, or tracking the real overlap length returned by addOverlap.

_{Reviews (1): Last reviewed commit: "feat(knowledge): add token, sentence, re..." | Re-trigger Greptile}

apps/sim/lib/chunkers/regex-chunker.ts

...sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx

apps/sim/lib/chunkers/sentence-chunker.ts

apps/sim/lib/chunkers/regex-chunker.ts

- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils - Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio) - Fix DocsChunker operator precedence bug and hard-coded 300-token limit - Fix JsonYamlChunker isStructuredData false positive on plain strings - Add MAX_DEPTH recursion guard to JsonYamlChunker - Replace @/components/ui/select with emcn DropdownMenu in strategy selector

- Expand RecursiveChunker recipes: markdown adds horizontal rules, code fences, blockquotes; code adds const/let/var/if/for/while/switch/return - RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing - RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces) - SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months and single-capital-letter lookbehind - Add overlap < maxSize validation in Zod schema and UI form - Add pattern max length (500) validation in Zod schema - Fix StructuredDataChunker footer grammar

- DocsChunker: extract headers from cleaned content (not raw markdown) to fix position mismatch between header positions and chunk positions - DocsChunker: strip export statements and JSX expressions in cleanContent - DocsChunker: fix table merge dedup using equality instead of includes - JsonYamlChunker: preserve path breadcrumbs when nested value fits in one chunk, matching LangChain RecursiveJsonSplitter behavior - StructuredDataChunker: detect 2-column CSV (lowered threshold from >2 to >=1) and use 20% relative tolerance instead of absolute +/-2 - TokenChunker: use sliding window overlap (matching LangChain/Chonkie) where chunks stay within chunkSize instead of exceeding it - utils: splitAtWordBoundaries accepts optional stepChars for sliding window overlap; addOverlap uses newline join instead of space

- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals - Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings - Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters - Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk - Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently - Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0 - Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures) - Fix existing test expectations for updated footer format and isStructuredData behavior

Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files.

- Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)"

Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent.

waleedlatif1 · 2026-04-11T01:51:29Z

@greptile

waleedlatif1 · 2026-04-11T01:51:32Z

@cursor review

greptile-apps · 2026-04-11T01:57:08Z

Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance.

apps/sim/lib/chunkers/utils.ts

apps/sim/lib/chunkers/text-chunker.ts

When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker).

waleedlatif1 · 2026-04-11T02:03:09Z

@cursor review

apps/sim/lib/chunkers/structured-data-chunker.ts

apps/sim/lib/chunkers/utils.ts

apps/sim/lib/chunkers/regex-chunker.ts

- Restore /3 token estimation for StructuredDataChunker (structured data is denser than prose, ~3 chars/token vs ~4) - Change addOverlap joiner from \n to space to match original TextChunker behavior

waleedlatif1 · 2026-04-11T02:17:15Z

@greptile

waleedlatif1 · 2026-04-11T02:17:18Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a53f760. Configure here.}

apps/sim/lib/chunkers/sentence-chunker.ts

When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct.

waleedlatif1 · 2026-04-11T02:30:41Z

@greptile

waleedlatif1 · 2026-04-11T02:30:45Z

@cursor review

- Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list

feat(knowledge): add token, sentence, recursive, and regex chunkers

9f83f87

greptile-apps bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/regex-chunker.ts Show resolved Hide resolved

...sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx Outdated Show resolved Hide resolved

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/sentence-chunker.ts Outdated Show resolved Hide resolved

apps/sim/lib/chunkers/regex-chunker.ts Show resolved Hide resolved

vercel bot temporarily deployed to Preview April 11, 2026 00:50 Inactive

vercel bot temporarily deployed to Preview April 11, 2026 00:55 Inactive

vercel bot temporarily deployed to Preview April 11, 2026 01:01 Inactive

chore(chunkers): lint formatting

4872e75

vercel bot temporarily deployed to Preview April 11, 2026 01:13 Inactive

updated styling

fc006ee

vercel bot temporarily deployed to Preview April 11, 2026 01:17 Inactive

vercel bot temporarily deployed to Preview April 11, 2026 01:32 Inactive

chore(chunkers): remove unnecessary comments and dead code

cb814ff

Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files.

vercel bot temporarily deployed to Preview April 11, 2026 01:46 Inactive

fix(chunkers): address PR review comments

899fc68

- Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)"

vercel bot temporarily deployed to Preview April 11, 2026 01:47 Inactive

fix(chunkers): use consistent overlap pattern in regex fallback

4c3508b

Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent.

vercel bot temporarily deployed to Preview April 11, 2026 01:51 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/utils.ts Show resolved Hide resolved

apps/sim/lib/chunkers/text-chunker.ts Show resolved Hide resolved

fix(chunkers): prevent content loss in word boundary splitting

3a26dad

When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker).

vercel bot temporarily deployed to Preview April 11, 2026 02:01 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/structured-data-chunker.ts Outdated Show resolved Hide resolved

apps/sim/lib/chunkers/utils.ts Outdated Show resolved Hide resolved

apps/sim/lib/chunkers/regex-chunker.ts Show resolved Hide resolved

fix(chunkers): restore structured data token ratio and overlap joiner

5e8b051

- Restore /3 token estimation for StructuredDataChunker (structured data is denser than prose, ~3 chars/token vs ~4) - Change addOverlap joiner from \n to space to match original TextChunker behavior

vercel bot temporarily deployed to Preview April 11, 2026 02:13 Inactive

lint

a53f760

vercel bot temporarily deployed to Preview April 11, 2026 02:17 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/sentence-chunker.ts Show resolved Hide resolved

fix(chunkers): fall back to character-level overlap in sentence chunker

ec6fa58

When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct.

vercel bot temporarily deployed to Preview April 11, 2026 02:29 Inactive

fix(chunkers): fix log message and add missing month abbreviations

e391efa

- Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list

vercel bot temporarily deployed to Preview April 11, 2026 02:37 Inactive

Conversation

waleedlatif1 commented Apr 11, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

greptile-apps bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Security Review

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

greptile-apps bot commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 11, 2026 •

edited

Loading

cursor bot commented Apr 11, 2026 •

edited

Loading

greptile-apps bot commented Apr 11, 2026 •

edited

Loading