Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
feat(knowledge): add token, sentence, recursive, and regex chunkers (#…
…4102)

* feat(knowledge): add token, sentence, recursive, and regex chunkers

* fix(chunkers): standardize token estimation and use emcn dropdown

- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils
- Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio)
- Fix DocsChunker operator precedence bug and hard-coded 300-token limit
- Fix JsonYamlChunker isStructuredData false positive on plain strings
- Add MAX_DEPTH recursion guard to JsonYamlChunker
- Replace @/components/ui/select with emcn DropdownMenu in strategy selector

* fix(chunkers): address research audit findings

- Expand RecursiveChunker recipes: markdown adds horizontal rules, code
  fences, blockquotes; code adds const/let/var/if/for/while/switch/return
- RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing
- RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces)
- SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months
  and single-capital-letter lookbehind
- Add overlap < maxSize validation in Zod schema and UI form
- Add pattern max length (500) validation in Zod schema
- Fix StructuredDataChunker footer grammar

* fix(chunkers): fix remaining audit issues across all chunkers

- DocsChunker: extract headers from cleaned content (not raw markdown)
  to fix position mismatch between header positions and chunk positions
- DocsChunker: strip export statements and JSX expressions in cleanContent
- DocsChunker: fix table merge dedup using equality instead of includes
- JsonYamlChunker: preserve path breadcrumbs when nested value fits in
  one chunk, matching LangChain RecursiveJsonSplitter behavior
- StructuredDataChunker: detect 2-column CSV (lowered threshold from >2
  to >=1) and use 20% relative tolerance instead of absolute +/-2
- TokenChunker: use sliding window overlap (matching LangChain/Chonkie)
  where chunks stay within chunkSize instead of exceeding it
- utils: splitAtWordBoundaries accepts optional stepChars for sliding
  window overlap; addOverlap uses newline join instead of space

* chore(chunkers): lint formatting

* updated styling

* fix(chunkers): audit fixes and comprehensive tests

- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals
- Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings
- Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters
- Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk
- Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently
- Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0
- Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures)
- Fix existing test expectations for updated footer format and isStructuredData behavior

* chore(chunkers): remove unnecessary comments and dead code

Strip 445 lines of redundant TSDoc, math calculation comments,
implementation rationale notes, and assertion-restating comments
across all chunker source and test files.

* fix(chunkers): address PR review comments

- Fix regex fallback path: use sliding window for overlap instead of
  passing chunkOverlap to buildChunks without prepended overlap text
- Fix misleading strategy label: "Text (hierarchical splitting)" →
  "Text (word boundary splitting)"

* fix(chunkers): use consistent overlap pattern in regex fallback

Use addOverlap + buildChunks(chunks, overlap) in the regex fallback
path to match the main path and all other chunkers (TextChunker,
RecursiveChunker). The sliding window approach was inconsistent.

* fix(chunkers): prevent content loss in word boundary splitting

When splitAtWordBoundaries snaps end back to a word boundary, advance
pos from end (not pos + step) in non-overlapping mode. The step-based
advancement is preserved for the sliding window case (TokenChunker).

* fix(chunkers): restore structured data token ratio and overlap joiner

- Restore /3 token estimation for StructuredDataChunker (structured data
  is denser than prose, ~3 chars/token vs ~4)
- Change addOverlap joiner from \n to space to match original TextChunker
  behavior

* lint

* fix(chunkers): fall back to character-level overlap in sentence chunker

When no complete sentence fits within the overlap budget,
fall back to character-level word-boundary overlap from the
previous group's text. This ensures buildChunks metadata is
always correct.

* fix(chunkers): fix log message and add missing month abbreviations

- Fix regex fallback log: "character splitting" → "word-boundary splitting"
- Add Jun and Jul to sentence chunker abbreviation list

* lint

* fix(chunkers): restore structured data detection threshold to > 2

avgCount >= 1 was too permissive — prose with consistent comma usage
would be misclassified as CSV. Restore original > 2 threshold while
keeping the improved proportional tolerance.

* fix(chunkers): pass chunkOverlap to buildChunks in TokenChunker

* fix(chunkers): restore separator-as-joiner pattern in splitRecursively

Separator was unconditionally prepended to parts after the first,
leaving leading punctuation on chunks after a boundary reset.

* feat(knowledge): add JSONL file support for knowledge base uploads

Parses JSON Lines files by splitting on newlines and converting to a
JSON array, which then flows through the existing JsonYamlChunker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
  • Loading branch information
waleedlatif1 and claude authored Apr 11, 2026
commit 1acafe87635491818648161e2a33fc6000aa9538
42 changes: 30 additions & 12 deletions apps/sim/app/api/knowledge/route.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,6 @@ import { captureServerEvent } from '@/lib/posthog/server'

const logger = createLogger('KnowledgeBaseAPI')

/**
* Schema for creating a knowledge base
*
* Chunking config units:
* - maxSize: tokens (1 token ≈ 4 characters)
* - minSize: characters
* - overlap: tokens (1 token ≈ 4 characters)
*/
const CreateKnowledgeBaseSchema = z.object({
name: z.string().min(1, 'Name is required'),
description: z.string().optional(),
Expand All @@ -31,12 +23,20 @@ const CreateKnowledgeBaseSchema = z.object({
embeddingDimension: z.literal(1536).default(1536),
chunkingConfig: z
.object({
/** Maximum chunk size in tokens (1 token ≈ 4 characters) */
maxSize: z.number().min(100).max(4000).default(1024),
/** Minimum chunk size in characters */
minSize: z.number().min(1).max(2000).default(100),
/** Overlap between chunks in tokens (1 token ≈ 4 characters) */
overlap: z.number().min(0).max(500).default(200),
strategy: z
.enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token'])
.default('auto')
.optional(),
strategyOptions: z
.object({
pattern: z.string().max(500).optional(),
separators: z.array(z.string()).optional(),
recipe: z.enum(['plain', 'markdown', 'code']).optional(),
})
.optional(),
})
.default({
maxSize: 1024,
Expand All @@ -45,13 +45,31 @@ const CreateKnowledgeBaseSchema = z.object({
})
.refine(
(data) => {
// Convert maxSize from tokens to characters for comparison (1 token ≈ 4 chars)
const maxSizeInChars = data.maxSize * 4
return data.minSize < maxSizeInChars
},
{
message: 'Min chunk size (characters) must be less than max chunk size (tokens × 4)',
}
)
.refine(
(data) => {
return data.overlap < data.maxSize
},
{
message: 'Overlap must be less than max chunk size',
}
)
.refine(
(data) => {
if (data.strategy === 'regex' && !data.strategyOptions?.pattern) {
return false
}
return true
},
{
message: 'Regex pattern is required when using the regex chunking strategy',
}
),
})

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,8 @@ export function AddDocumentsModal({
{isDragging ? 'Drop files here' : 'Drop files here or click to browse'}
</span>
<span className='text-[var(--text-tertiary)] text-xs'>
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML (max 100MB each)
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML, JSONL (max 100MB
each)
</span>
</div>
</Button>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ import { useForm } from 'react-hook-form'
import { z } from 'zod'
import {
Button,
Combobox,
type ComboboxOption,
Input,
Label,
Modal,
Expand All @@ -18,6 +20,7 @@ import {
ModalHeader,
Textarea,
} from '@/components/emcn'
import type { StrategyOptions } from '@/lib/chunkers/types'
import { cn } from '@/lib/core/utils/cn'
import { formatFileSize, validateKnowledgeBaseFile } from '@/lib/uploads/utils/file-utils'
import { ACCEPT_ATTRIBUTE } from '@/lib/uploads/utils/validation'
Expand All @@ -35,6 +38,20 @@ interface CreateBaseModalProps {
onOpenChange: (open: boolean) => void
}

const STRATEGY_OPTIONS = [
{ value: 'auto', label: 'Auto (detect from content)' },
{ value: 'text', label: 'Text (word boundary splitting)' },
{ value: 'recursive', label: 'Recursive (configurable separators)' },
{ value: 'sentence', label: 'Sentence' },
{ value: 'token', label: 'Token (fixed-size)' },
{ value: 'regex', label: 'Regex (custom pattern)' },
] as const

const STRATEGY_COMBOBOX_OPTIONS: ComboboxOption[] = STRATEGY_OPTIONS.map((o) => ({
label: o.label,
value: o.value,
}))

const FormSchema = z
.object({
name: z
Expand All @@ -43,25 +60,24 @@ const FormSchema = z
.max(100, 'Name must be less than 100 characters')
.refine((value) => value.trim().length > 0, 'Name cannot be empty'),
description: z.string().max(500, 'Description must be less than 500 characters').optional(),
/** Minimum chunk size in characters */
minChunkSize: z
.number()
.min(1, 'Min chunk size must be at least 1 character')
.max(2000, 'Min chunk size must be less than 2000 characters'),
/** Maximum chunk size in tokens (1 token ≈ 4 characters) */
maxChunkSize: z
.number()
.min(100, 'Max chunk size must be at least 100 tokens')
.max(4000, 'Max chunk size must be less than 4000 tokens'),
/** Overlap between chunks in tokens */
overlapSize: z
.number()
.min(0, 'Overlap must be non-negative')
.max(500, 'Overlap must be less than 500 tokens'),
strategy: z.enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token']).default('auto'),
regexPattern: z.string().optional(),
customSeparators: z.string().optional(),
})
.refine(
(data) => {
// Convert maxChunkSize from tokens to characters for comparison (1 token ≈ 4 chars)
const maxChunkSizeInChars = data.maxChunkSize * 4
return data.minChunkSize < maxChunkSizeInChars
},
Expand All @@ -70,6 +86,27 @@ const FormSchema = z
path: ['minChunkSize'],
}
)
.refine(
(data) => {
return data.overlapSize < data.maxChunkSize
},
{
message: 'Overlap must be less than max chunk size',
path: ['overlapSize'],
}
)
.refine(
(data) => {
if (data.strategy === 'regex' && !data.regexPattern?.trim()) {
return false
}
return true
},
{
message: 'Regex pattern is required when using the regex strategy',
path: ['regexPattern'],
}
)

type FormValues = z.infer<typeof FormSchema>

Expand Down Expand Up @@ -124,6 +161,7 @@ export const CreateBaseModal = memo(function CreateBaseModal({
handleSubmit,
reset,
watch,
setValue,
formState: { errors },
} = useForm<FormValues>({
resolver: zodResolver(FormSchema),
Expand All @@ -133,11 +171,15 @@ export const CreateBaseModal = memo(function CreateBaseModal({
minChunkSize: 100,
maxChunkSize: 1024,
overlapSize: 200,
strategy: 'auto',
regexPattern: '',
customSeparators: '',
},
mode: 'onSubmit',
})

const nameValue = watch('name')
const strategyValue = watch('strategy')

useEffect(() => {
if (open) {
Expand All @@ -153,6 +195,9 @@ export const CreateBaseModal = memo(function CreateBaseModal({
minChunkSize: 100,
maxChunkSize: 1024,
overlapSize: 200,
strategy: 'auto',
regexPattern: '',
customSeparators: '',
})
}
}, [open, reset])
Expand Down Expand Up @@ -255,6 +300,17 @@ export const CreateBaseModal = memo(function CreateBaseModal({
setSubmitStatus(null)

try {
const strategyOptions: StrategyOptions | undefined =
data.strategy === 'regex' && data.regexPattern
? { pattern: data.regexPattern }
: data.strategy === 'recursive' && data.customSeparators?.trim()
? {
separators: data.customSeparators
.split(',')
.map((s) => s.trim().replace(/\\n/g, '\n').replace(/\\t/g, '\t')),
}
: undefined

const newKnowledgeBase = await createKnowledgeBaseMutation.mutateAsync({
name: data.name,
description: data.description || undefined,
Expand All @@ -263,6 +319,8 @@ export const CreateBaseModal = memo(function CreateBaseModal({
maxSize: data.maxChunkSize,
minSize: data.minChunkSize,
overlap: data.overlapSize,
...(data.strategy !== 'auto' && { strategy: data.strategy }),
...(strategyOptions && { strategyOptions }),
},
})

Expand Down Expand Up @@ -312,7 +370,6 @@ export const CreateBaseModal = memo(function CreateBaseModal({
<div className='space-y-3'>
<div className='flex flex-col gap-2'>
<Label htmlFor='kb-name'>Name</Label>
{/* Hidden decoy fields to prevent browser autofill */}
<input
type='text'
name='fakeusernameremembered'
Expand Down Expand Up @@ -403,6 +460,59 @@ export const CreateBaseModal = memo(function CreateBaseModal({
</p>
</div>

<div className='flex flex-col gap-2'>
<Label>Chunking Strategy</Label>
<Combobox
options={STRATEGY_COMBOBOX_OPTIONS}
value={strategyValue}
onChange={(value) => setValue('strategy', value as FormValues['strategy'])}
dropdownWidth='trigger'
align='start'
/>
<p className='text-[var(--text-muted)] text-xs'>
Auto detects the best strategy based on file content type.
</p>
</div>

{strategyValue === 'regex' && (
<div className='flex flex-col gap-2'>
<Label htmlFor='regexPattern'>Regex Pattern</Label>
<Input
id='regexPattern'
placeholder='e.g. \\n\\n or (?<=\\})\\s*(?=\\{)'
{...register('regexPattern')}
className={cn(errors.regexPattern && 'border-[var(--text-error)]')}
autoComplete='off'
data-form-type='other'
/>
{errors.regexPattern && (
<p className='text-[var(--text-error)] text-xs'>
{errors.regexPattern.message}
</p>
)}
<p className='text-[var(--text-muted)] text-xs'>
Text will be split at each match of this regex pattern.
</p>
</div>
)}

{strategyValue === 'recursive' && (
<div className='flex flex-col gap-2'>
<Label htmlFor='customSeparators'>Custom Separators (optional)</Label>
<Input
id='customSeparators'
placeholder='e.g. \n\n, \n, . , '
{...register('customSeparators')}
autoComplete='off'
data-form-type='other'
/>
<p className='text-[var(--text-muted)] text-xs'>
Comma-separated list of delimiters in priority order. Leave empty for default
separators.
</p>
</div>
)}

<div className='flex flex-col gap-2'>
<Label>Upload Documents</Label>
<Button
Expand Down Expand Up @@ -431,7 +541,8 @@ export const CreateBaseModal = memo(function CreateBaseModal({
{isDragging ? 'Drop files here' : 'Drop files here or click to browse'}
</span>
<span className='text-[var(--text-tertiary)] text-xs'>
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML (max 100MB each)
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML, JSONL (max 100MB
each)
</span>
</div>
</Button>
Expand Down
12 changes: 4 additions & 8 deletions apps/sim/hooks/queries/kb/knowledge.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { createLogger } from '@sim/logger'
import { keepPreviousData, useMutation, useQuery, useQueryClient } from '@tanstack/react-query'
import { toast } from '@/components/emcn'
import type { ChunkingStrategy, StrategyOptions } from '@/lib/chunkers/types'
import type {
ChunkData,
ChunksPagination,
Expand Down Expand Up @@ -338,10 +339,7 @@ export interface DocumentChunkSearchParams {
search: string
}

/**
* Fetches all chunks matching a search query by paginating through results.
* This is used for search functionality where we need all matching chunks.
*/
/** Paginates through all matching chunks rather than returning a single page. */
export async function fetchAllDocumentChunks(
{ knowledgeBaseId, documentId, search }: DocumentChunkSearchParams,
signal?: AbortSignal
Expand Down Expand Up @@ -376,10 +374,6 @@ export const serializeSearchParams = (params: DocumentChunkSearchParams) =>
search: params.search,
})

/**
* Hook to search for chunks in a document.
* Fetches all matching chunks and returns them for client-side pagination.
*/
export function useDocumentChunkSearchQuery(
params: DocumentChunkSearchParams,
options?: {
Expand Down Expand Up @@ -707,6 +701,8 @@ export interface CreateKnowledgeBaseParams {
maxSize: number
minSize: number
overlap: number
strategy?: ChunkingStrategy
strategyOptions?: StrategyOptions
}
}

Expand Down
Loading
Loading