feat(knowledge): add token, sentence, recursive, and regex chunkers (#…

…4102) * feat(knowledge): add token, sentence, recursive, and regex chunkers * fix(chunkers): standardize token estimation and use emcn dropdown - Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils - Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio) - Fix DocsChunker operator precedence bug and hard-coded 300-token limit - Fix JsonYamlChunker isStructuredData false positive on plain strings - Add MAX_DEPTH recursion guard to JsonYamlChunker - Replace @/components/ui/select with emcn DropdownMenu in strategy selector * fix(chunkers): address research audit findings - Expand RecursiveChunker recipes: markdown adds horizontal rules, code fences, blockquotes; code adds const/let/var/if/for/while/switch/return - RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing - RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces) - SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months and single-capital-letter lookbehind - Add overlap < maxSize validation in Zod schema and UI form - Add pattern max length (500) validation in Zod schema - Fix StructuredDataChunker footer grammar * fix(chunkers): fix remaining audit issues across all chunkers - DocsChunker: extract headers from cleaned content (not raw markdown) to fix position mismatch between header positions and chunk positions - DocsChunker: strip export statements and JSX expressions in cleanContent - DocsChunker: fix table merge dedup using equality instead of includes - JsonYamlChunker: preserve path breadcrumbs when nested value fits in one chunk, matching LangChain RecursiveJsonSplitter behavior - StructuredDataChunker: detect 2-column CSV (lowered threshold from >2 to >=1) and use 20% relative tolerance instead of absolute +/-2 - TokenChunker: use sliding window overlap (matching LangChain/Chonkie) where chunks stay within chunkSize instead of exceeding it - utils: splitAtWordBoundaries accepts optional stepChars for sliding window overlap; addOverlap uses newline join instead of space * chore(chunkers): lint formatting * updated styling * fix(chunkers): audit fixes and comprehensive tests - Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals - Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings - Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters - Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk - Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently - Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0 - Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures) - Fix existing test expectations for updated footer format and isStructuredData behavior * chore(chunkers): remove unnecessary comments and dead code Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files. * fix(chunkers): address PR review comments - Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)" * fix(chunkers): use consistent overlap pattern in regex fallback Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent. * fix(chunkers): prevent content loss in word boundary splitting When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker). * fix(chunkers): restore structured data token ratio and overlap joiner - Restore /3 token estimation for StructuredDataChunker (structured data is denser than prose, ~3 chars/token vs ~4) - Change addOverlap joiner from \n to space to match original TextChunker behavior * lint * fix(chunkers): fall back to character-level overlap in sentence chunker When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct. * fix(chunkers): fix log message and add missing month abbreviations - Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list * lint * fix(chunkers): restore structured data detection threshold to > 2 avgCount >= 1 was too permissive — prose with consistent comma usage would be misclassified as CSV. Restore original > 2 threshold while keeping the improved proportional tolerance. * fix(chunkers): pass chunkOverlap to buildChunks in TokenChunker * fix(chunkers): restore separator-as-joiner pattern in splitRecursively Separator was unconditionally prepended to parts after the first, leaving leading punctuation on chunks after a boundary reset. * feat(knowledge): add JSONL file support for knowledge base uploads Parses JSON Lines files by splitting on newlines and converting to a JSON array, which then flows through the existing JsonYamlChunker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
simstudioai · waleedlatif1 · Apr 11, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
commit 1acafe87635491818648161e2a33fc6000aa9538
diff --git a/apps/sim/app/api/knowledge/route.ts b/apps/sim/app/api/knowledge/route.ts
@@ -15,14 +15,6 @@ import { captureServerEvent } from '@/lib/posthog/server'
 
 const logger = createLogger('KnowledgeBaseAPI')
 
-/**
- * Schema for creating a knowledge base
- *
- * Chunking config units:
- * - maxSize: tokens (1 token ≈ 4 characters)
- * - minSize: characters
- * - overlap: tokens (1 token ≈ 4 characters)
- */
 const CreateKnowledgeBaseSchema = z.object({
   name: z.string().min(1, 'Name is required'),
   description: z.string().optional(),
@@ -31,12 +23,20 @@ const CreateKnowledgeBaseSchema = z.object({
   embeddingDimension: z.literal(1536).default(1536),
   chunkingConfig: z
     .object({
-      /** Maximum chunk size in tokens (1 token ≈ 4 characters) */
       maxSize: z.number().min(100).max(4000).default(1024),
-      /** Minimum chunk size in characters */
       minSize: z.number().min(1).max(2000).default(100),
-      /** Overlap between chunks in tokens (1 token ≈ 4 characters) */
       overlap: z.number().min(0).max(500).default(200),
+      strategy: z
+        .enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token'])
+        .default('auto')
+        .optional(),
+      strategyOptions: z
+        .object({
+          pattern: z.string().max(500).optional(),
+          separators: z.array(z.string()).optional(),
+          recipe: z.enum(['plain', 'markdown', 'code']).optional(),
+        })
+        .optional(),
     })
     .default({
       maxSize: 1024,
@@ -45,13 +45,31 @@ const CreateKnowledgeBaseSchema = z.object({
     })
     .refine(
       (data) => {
-        // Convert maxSize from tokens to characters for comparison (1 token ≈ 4 chars)
         const maxSizeInChars = data.maxSize * 4
         return data.minSize < maxSizeInChars
       },
       {
         message: 'Min chunk size (characters) must be less than max chunk size (tokens × 4)',
       }
+    )
+    .refine(
+      (data) => {
+        return data.overlap < data.maxSize
+      },
+      {
+        message: 'Overlap must be less than max chunk size',
+      }
+    )
+    .refine(
+      (data) => {
+        if (data.strategy === 'regex' && !data.strategyOptions?.pattern) {
+          return false
+        }
+        return true
+      },
+      {
+        message: 'Regex pattern is required when using the regex chunking strategy',
+      }
     ),
 })
 

diff --git a/...space/[workspaceId]/knowledge/[id]/components/add-documents-modal/add-documents-modal.tsx b/...space/[workspaceId]/knowledge/[id]/components/add-documents-modal/add-documents-modal.tsx
@@ -263,7 +263,8 @@ export function AddDocumentsModal({
                       {isDragging ? 'Drop files here' : 'Drop files here or click to browse'}
                     </span>
                     <span className='text-[var(--text-tertiary)] text-xs'>
-                      PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML (max 100MB each)
+                      PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML, JSONL (max 100MB
+                      each)
                     </span>
                   </div>
                 </Button>

diff --git a/.../app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx b/.../app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx
@@ -9,6 +9,8 @@ import { useForm } from 'react-hook-form'
 import { z } from 'zod'
 import {
   Button,
+  Combobox,
+  type ComboboxOption,
   Input,
   Label,
   Modal,
@@ -18,6 +20,7 @@ import {
   ModalHeader,
   Textarea,
 } from '@/components/emcn'
+import type { StrategyOptions } from '@/lib/chunkers/types'
 import { cn } from '@/lib/core/utils/cn'
 import { formatFileSize, validateKnowledgeBaseFile } from '@/lib/uploads/utils/file-utils'
 import { ACCEPT_ATTRIBUTE } from '@/lib/uploads/utils/validation'
@@ -35,6 +38,20 @@ interface CreateBaseModalProps {
   onOpenChange: (open: boolean) => void
 }
 
+const STRATEGY_OPTIONS = [
+  { value: 'auto', label: 'Auto (detect from content)' },
+  { value: 'text', label: 'Text (word boundary splitting)' },
+  { value: 'recursive', label: 'Recursive (configurable separators)' },
+  { value: 'sentence', label: 'Sentence' },
+  { value: 'token', label: 'Token (fixed-size)' },
+  { value: 'regex', label: 'Regex (custom pattern)' },
+] as const
+
+const STRATEGY_COMBOBOX_OPTIONS: ComboboxOption[] = STRATEGY_OPTIONS.map((o) => ({
+  label: o.label,
+  value: o.value,
+}))
+
 const FormSchema = z
   .object({
     name: z
@@ -43,25 +60,24 @@ const FormSchema = z
       .max(100, 'Name must be less than 100 characters')
       .refine((value) => value.trim().length > 0, 'Name cannot be empty'),
     description: z.string().max(500, 'Description must be less than 500 characters').optional(),
-    /** Minimum chunk size in characters */
     minChunkSize: z
       .number()
       .min(1, 'Min chunk size must be at least 1 character')
       .max(2000, 'Min chunk size must be less than 2000 characters'),
-    /** Maximum chunk size in tokens (1 token ≈ 4 characters) */
     maxChunkSize: z
       .number()
       .min(100, 'Max chunk size must be at least 100 tokens')
       .max(4000, 'Max chunk size must be less than 4000 tokens'),
-    /** Overlap between chunks in tokens */
     overlapSize: z
       .number()
       .min(0, 'Overlap must be non-negative')
       .max(500, 'Overlap must be less than 500 tokens'),
+    strategy: z.enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token']).default('auto'),
+    regexPattern: z.string().optional(),
+    customSeparators: z.string().optional(),
   })
   .refine(
     (data) => {
-      // Convert maxChunkSize from tokens to characters for comparison (1 token ≈ 4 chars)
       const maxChunkSizeInChars = data.maxChunkSize * 4
       return data.minChunkSize < maxChunkSizeInChars
     },
@@ -70,6 +86,27 @@ const FormSchema = z
       path: ['minChunkSize'],
     }
   )
+  .refine(
+    (data) => {
+      return data.overlapSize < data.maxChunkSize
+    },
+    {
+      message: 'Overlap must be less than max chunk size',
+      path: ['overlapSize'],
+    }
+  )
+  .refine(
+    (data) => {
+      if (data.strategy === 'regex' && !data.regexPattern?.trim()) {
+        return false
+      }
+      return true
+    },
+    {
+      message: 'Regex pattern is required when using the regex strategy',
+      path: ['regexPattern'],
+    }
+  )
 
 type FormValues = z.infer<typeof FormSchema>
 
@@ -124,6 +161,7 @@ export const CreateBaseModal = memo(function CreateBaseModal({
     handleSubmit,
     reset,
     watch,
+    setValue,
     formState: { errors },
   } = useForm<FormValues>({
     resolver: zodResolver(FormSchema),
@@ -133,11 +171,15 @@ export const CreateBaseModal = memo(function CreateBaseModal({
       minChunkSize: 100,
       maxChunkSize: 1024,
       overlapSize: 200,
+      strategy: 'auto',
+      regexPattern: '',
+      customSeparators: '',
     },
     mode: 'onSubmit',
   })
 
   const nameValue = watch('name')
+  const strategyValue = watch('strategy')
 
   useEffect(() => {
     if (open) {
@@ -153,6 +195,9 @@ export const CreateBaseModal = memo(function CreateBaseModal({
         minChunkSize: 100,
         maxChunkSize: 1024,
         overlapSize: 200,
+        strategy: 'auto',
+        regexPattern: '',
+        customSeparators: '',
       })
     }
   }, [open, reset])
@@ -255,6 +300,17 @@ export const CreateBaseModal = memo(function CreateBaseModal({
     setSubmitStatus(null)
 
     try {
+      const strategyOptions: StrategyOptions | undefined =
+        data.strategy === 'regex' && data.regexPattern
+          ? { pattern: data.regexPattern }
+          : data.strategy === 'recursive' && data.customSeparators?.trim()
+            ? {
+                separators: data.customSeparators
+                  .split(',')
+                  .map((s) => s.trim().replace(/\\n/g, '\n').replace(/\\t/g, '\t')),
+              }
+            : undefined
+
       const newKnowledgeBase = await createKnowledgeBaseMutation.mutateAsync({
         name: data.name,
         description: data.description || undefined,
@@ -263,6 +319,8 @@ export const CreateBaseModal = memo(function CreateBaseModal({
           maxSize: data.maxChunkSize,
           minSize: data.minChunkSize,
           overlap: data.overlapSize,
+          ...(data.strategy !== 'auto' && { strategy: data.strategy }),
+          ...(strategyOptions && { strategyOptions }),
         },
       })
 
@@ -312,7 +370,6 @@ export const CreateBaseModal = memo(function CreateBaseModal({
               <div className='space-y-3'>
                 <div className='flex flex-col gap-2'>
                   <Label htmlFor='kb-name'>Name</Label>
-                  {/* Hidden decoy fields to prevent browser autofill */}
                   <input
                     type='text'
                     name='fakeusernameremembered'
@@ -403,6 +460,59 @@ export const CreateBaseModal = memo(function CreateBaseModal({
                   </p>
                 </div>
 
+                <div className='flex flex-col gap-2'>
+                  <Label>Chunking Strategy</Label>
+                  <Combobox
+                    options={STRATEGY_COMBOBOX_OPTIONS}
+                    value={strategyValue}
+                    onChange={(value) => setValue('strategy', value as FormValues['strategy'])}
+                    dropdownWidth='trigger'
+                    align='start'
+                  />
+                  <p className='text-[var(--text-muted)] text-xs'>
+                    Auto detects the best strategy based on file content type.
+                  </p>
+                </div>
+
+                {strategyValue === 'regex' && (
+                  <div className='flex flex-col gap-2'>
+                    <Label htmlFor='regexPattern'>Regex Pattern</Label>
+                    <Input
+                      id='regexPattern'
+                      placeholder='e.g. \\n\\n or (?<=\\})\\s*(?=\\{)'
+                      {...register('regexPattern')}
+                      className={cn(errors.regexPattern && 'border-[var(--text-error)]')}
+                      autoComplete='off'
+                      data-form-type='other'
+                    />
+                    {errors.regexPattern && (
+                      <p className='text-[var(--text-error)] text-xs'>
+                        {errors.regexPattern.message}
+                      </p>
+                    )}
+                    <p className='text-[var(--text-muted)] text-xs'>
+                      Text will be split at each match of this regex pattern.
+                    </p>
+                  </div>
+                )}
+
+                {strategyValue === 'recursive' && (
+                  <div className='flex flex-col gap-2'>
+                    <Label htmlFor='customSeparators'>Custom Separators (optional)</Label>
+                    <Input
+                      id='customSeparators'
+                      placeholder='e.g. \n\n, \n, . ,  '
+                      {...register('customSeparators')}
+                      autoComplete='off'
+                      data-form-type='other'
+                    />
+                    <p className='text-[var(--text-muted)] text-xs'>
+                      Comma-separated list of delimiters in priority order. Leave empty for default
+                      separators.
+                    </p>
+                  </div>
+                )}
+
                 <div className='flex flex-col gap-2'>
                   <Label>Upload Documents</Label>
                   <Button
@@ -431,7 +541,8 @@ export const CreateBaseModal = memo(function CreateBaseModal({
                         {isDragging ? 'Drop files here' : 'Drop files here or click to browse'}
                       </span>
                       <span className='text-[var(--text-tertiary)] text-xs'>
-                        PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML (max 100MB each)
+                        PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML, JSONL (max 100MB
+                        each)
                       </span>
                     </div>
                   </Button>

diff --git a/apps/sim/hooks/queries/kb/knowledge.ts b/apps/sim/hooks/queries/kb/knowledge.ts
@@ -1,6 +1,7 @@
 import { createLogger } from '@sim/logger'
 import { keepPreviousData, useMutation, useQuery, useQueryClient } from '@tanstack/react-query'
 import { toast } from '@/components/emcn'
+import type { ChunkingStrategy, StrategyOptions } from '@/lib/chunkers/types'
 import type {
   ChunkData,
   ChunksPagination,
@@ -338,10 +339,7 @@ export interface DocumentChunkSearchParams {
   search: string
 }
 
-/**
- * Fetches all chunks matching a search query by paginating through results.
- * This is used for search functionality where we need all matching chunks.
- */
+/** Paginates through all matching chunks rather than returning a single page. */
 export async function fetchAllDocumentChunks(
   { knowledgeBaseId, documentId, search }: DocumentChunkSearchParams,
   signal?: AbortSignal
@@ -376,10 +374,6 @@ export const serializeSearchParams = (params: DocumentChunkSearchParams) =>
     search: params.search,
   })
 
-/**
- * Hook to search for chunks in a document.
- * Fetches all matching chunks and returns them for client-side pagination.
- */
 export function useDocumentChunkSearchQuery(
   params: DocumentChunkSearchParams,
   options?: {
@@ -707,6 +701,8 @@ export interface CreateKnowledgeBaseParams {
     maxSize: number
     minSize: number
     overlap: number
+    strategy?: ChunkingStrategy
+    strategyOptions?: StrategyOptions
   }
 }