Skip to content

feat(knowledge): add token, sentence, recursive, and regex chunkers#4102

Open
waleedlatif1 wants to merge 15 commits intostagingfrom
waleedlatif1/add-chunkers
Open

feat(knowledge): add token, sentence, recursive, and regex chunkers#4102
waleedlatif1 wants to merge 15 commits intostagingfrom
waleedlatif1/add-chunkers

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Add 4 new chunking strategies: token, sentence, recursive, regex
  • Extract shared chunking utilities (estimateTokens, cleanText, addOverlap, splitAtWordBoundaries) into utils.ts
  • Wire strategy selection through full stack: UI → API → service → document processor
  • Default to auto-detection (existing behavior), users can optionally select a strategy
  • Add ReDoS protection for user-supplied regex patterns
  • Add Zod validation for strategy options at API layer

Type of Change

  • New feature

Testing

Tested manually. All 53 existing chunker tests pass.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 11, 2026 2:37am

Request Review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 11, 2026

PR Summary

Medium Risk
Expands knowledge-base ingestion with multiple new chunking strategies and wires new config through API/UI into document processing, which can change chunk boundaries/token counts and affect embedding generation. Regex-based splitting adds user-supplied patterns (with safeguards) that could still impact performance or chunk quality if misconfigured.

Overview
Adds support for selectable document chunking strategies (token, sentence, recursive, regex) while keeping auto as the default behavior.

Wires chunkingConfig.strategy and chunkingConfig.strategyOptions end-to-end (UI modal → /api/knowledge validation → KB create params/types → document processing), including additional validation (e.g., overlap < maxSize and requiring a regex pattern when using regex).

Refactors/extends the chunking implementation: extracts shared helpers into lib/chunkers/utils.ts, adds new chunker implementations and tests, updates TextChunker to use the shared utilities, tightens JSON/YAML structured-data detection, and improves DocsChunker/StructuredDataChunker token estimation and chunk formatting to better align chunk metadata and size limits.

Reviewed by Cursor Bugbot for commit ec6fa58. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 11, 2026

Greptile Summary

This PR adds four new chunking strategies (token, sentence, recursive, regex) to the knowledge base feature, extracts shared utilities into utils.ts, threads strategy selection from the UI through Zod validation at the API layer down to the document processor, and defaults to the existing auto-detection path.

  • Security — bypassable ReDoS guard (regex-chunker.ts): the timing check runs against a fixed mixed-character string that does not trigger catastrophic backtracking for all vulnerable patterns (e.g. (\\s+)+ or (\\w+\\s*)+), so the protection is unreliable. A static analysis approach is strongly recommended instead.
  • UX — mislabeled strategy option (create-base-modal.tsx): the 'text' option is labeled "Text (hierarchical splitting)" but the TextChunker does word-boundary splitting; it is the RecursiveChunker that is hierarchical.
  • Metadata drift (utils.ts): buildChunks index tracking becomes inaccurate under overlap because addOverlap may trim to a word boundary, leaving startIndex/endIndex offsets misaligned with true document positions.

Confidence Score: 4/5

Mergeable after addressing the bypassable ReDoS check in RegexChunker; the rest of the stack is solid.

One P1 security finding (ReDoS guard can be bypassed by crafted patterns safe against the test string but catastrophic against monotone user content) requires attention before merge. The remaining findings are P2.

apps/sim/lib/chunkers/regex-chunker.ts — ReDoS timing guard needs replacement with a static analysis approach

Security Review

  • ReDoS — regex-chunker.ts: user-supplied regex patterns are accepted and tested with a timing heuristic against a single fixed test string. The heuristic can be bypassed by patterns that are O(n) on the test string but exponential on real content, allowing a user to submit a pattern that hangs the document-processing worker when applied to certain uploaded files.
  • No secrets are exposed and no SQL/command injection vectors are introduced.

Important Files Changed

Filename Overview
apps/sim/lib/chunkers/regex-chunker.ts New regex-based chunker with a timing-based ReDoS guard that is bypassable by patterns safe against the fixed test string but catastrophic against monotone user content.
apps/sim/lib/chunkers/utils.ts New shared utilities; buildChunks index tracking drifts from true document positions when overlap is applied.
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx Strategy selector UI added; 'text' option labeled 'hierarchical splitting' which is misleading — that description fits the recursive chunker instead.
apps/sim/app/api/knowledge/route.ts Zod schema extended with strategy/strategyOptions fields including a cross-field refine requiring regex pattern when strategy is 'regex'; correct and complete.
apps/sim/lib/knowledge/documents/document-processor.ts Strategy dispatch logic added; explicit strategies bypass auto-detection cleanly, fallback to auto-detect when strategy is 'auto' or undefined is preserved.
apps/sim/lib/chunkers/types.ts New types ChunkingStrategy, StrategyOptions, and per-chunker option interfaces added; old conflicting strategy union in ExtendedChunkingConfig correctly removed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    UI["CreateBaseModal\nstrategy selector"] -->|POST chunkingConfig| API["POST /api/knowledge\nZod validation"]
    API -->|createKnowledgeBase| KB_DB[("knowledge_base.chunkingConfig\nstrategy + strategyOptions")]
    KB_DB -->|processDocumentAsync| SVC["documents/service.ts\nreads rawConfig.strategy"]
    SVC -->|processDocument| DP["document-processor.ts"]
    DP -->|strategy not auto| DISPATCH["applyStrategy()"]
    DP -->|strategy auto or undefined| AUTO["Auto-detect by MIME/content"]
    DISPATCH -->|token| TC["TokenChunker"]
    DISPATCH -->|sentence| SC["SentenceChunker"]
    DISPATCH -->|recursive| RC["RecursiveChunker"]
    DISPATCH -->|regex| RGX["RegexChunker"]
    DISPATCH -->|text / default| TX["TextChunker"]
    AUTO -->|JSON/YAML| JY["JsonYamlChunker"]
    AUTO -->|CSV/spreadsheet| SD["StructuredDataChunker"]
    AUTO -->|other| TX
Loading

Comments Outside Diff (1)

  1. apps/sim/lib/chunkers/utils.ts, line 942-969 (link)

    P2 buildChunks index tracking is inaccurate when overlap is active

    When overlapTokens > 0, buildChunks estimates the overlap length as Math.min(overlapChars, prevChunk.length, text.length), but the text passed in has already been modified by addOverlap — which trims to a word boundary, so the actual prepended overlap may be shorter than overlapChars. As a result startIndex is over-subtracted and endIndex undershoots, causing the metadata.startIndex / endIndex offsets to drift from their true positions in the original document as chunk index increases.

    If these offsets are used for document-level highlighting or retrieval provenance, they will be wrong for all but the first chunk. Consider computing the offsets before the overlap step, or tracking the real overlap length returned by addOverlap.

Reviews (1): Last reviewed commit: "feat(knowledge): add token, sentence, re..." | Re-trigger Greptile

- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils
- Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio)
- Fix DocsChunker operator precedence bug and hard-coded 300-token limit
- Fix JsonYamlChunker isStructuredData false positive on plain strings
- Add MAX_DEPTH recursion guard to JsonYamlChunker
- Replace @/components/ui/select with emcn DropdownMenu in strategy selector
- Expand RecursiveChunker recipes: markdown adds horizontal rules, code
  fences, blockquotes; code adds const/let/var/if/for/while/switch/return
- RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing
- RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces)
- SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months
  and single-capital-letter lookbehind
- Add overlap < maxSize validation in Zod schema and UI form
- Add pattern max length (500) validation in Zod schema
- Fix StructuredDataChunker footer grammar
- DocsChunker: extract headers from cleaned content (not raw markdown)
  to fix position mismatch between header positions and chunk positions
- DocsChunker: strip export statements and JSX expressions in cleanContent
- DocsChunker: fix table merge dedup using equality instead of includes
- JsonYamlChunker: preserve path breadcrumbs when nested value fits in
  one chunk, matching LangChain RecursiveJsonSplitter behavior
- StructuredDataChunker: detect 2-column CSV (lowered threshold from >2
  to >=1) and use 20% relative tolerance instead of absolute +/-2
- TokenChunker: use sliding window overlap (matching LangChain/Chonkie)
  where chunks stay within chunkSize instead of exceeding it
- utils: splitAtWordBoundaries accepts optional stepChars for sliding
  window overlap; addOverlap uses newline join instead of space
- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals
- Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings
- Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters
- Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk
- Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently
- Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0
- Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures)
- Fix existing test expectations for updated footer format and isStructuredData behavior
Strip 445 lines of redundant TSDoc, math calculation comments,
implementation rationale notes, and assertion-restating comments
across all chunker source and test files.
- Fix regex fallback path: use sliding window for overlap instead of
  passing chunkOverlap to buildChunks without prepended overlap text
- Fix misleading strategy label: "Text (hierarchical splitting)" →
  "Text (word boundary splitting)"
Use addOverlap + buildChunks(chunks, overlap) in the regex fallback
path to match the main path and all other chunkers (TextChunker,
RecursiveChunker). The sliding window approach was inconsistent.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 11, 2026

Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance.

When splitAtWordBoundaries snaps end back to a word boundary, advance
pos from end (not pos + step) in non-overlapping mode. The step-based
advancement is preserved for the sliding window case (TokenChunker).
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

- Restore /3 token estimation for StructuredDataChunker (structured data
  is denser than prose, ~3 chars/token vs ~4)
- Change addOverlap joiner from \n to space to match original TextChunker
  behavior
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a53f760. Configure here.

When no complete sentence fits within the overlap budget,
fall back to character-level word-boundary overlap from the
previous group's text. This ensures buildChunks metadata is
always correct.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

- Fix regex fallback log: "character splitting" → "word-boundary splitting"
- Add Jun and Jul to sentence chunker abbreviation list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant