Skip to content
Open
Changes from 1 commit
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
255640f
feat(knowledge): add Ollama embedding types
teedonk Mar 22, 2026
b043bc2
feat(knowledge): add per-KB dynamic pgvector tables
teedonk Mar 22, 2026
61f05a7
feat(knowledge): add Ollama embedding generation with retry and smart…
teedonk Mar 22, 2026
546dd7c
feat(knowledge): store ollamaBaseUrl in KB config
teedonk Mar 22, 2026
616761d
feat(chunkers): add embeddingModel to ChunkerOptions
teedonk Mar 22, 2026
133f326
feat(chunkers): add model-aware token estimation ratio
teedonk Mar 22, 2026
2693251
feat(knowledge): pass embeddingModel to all chunkers
teedonk Mar 22, 2026
18e7ac2
feat(knowledge): add Ollama chunk size and overlap capping
teedonk Mar 22, 2026
983efc3
feat(knowledge): add Ollama model validation and auto-detect dimension
teedonk Mar 22, 2026
53a1423
feat(knowledge): update KB detail API for Ollama support
teedonk Mar 22, 2026
0b5d218
feat(knowledge): add provider routing and cross-provider score normal…
teedonk Mar 22, 2026
606b70b
feat(knowledge): add Ollama provider selection UI
teedonk Mar 22, 2026
da36fcd
feat(knowledge): add Ollama params to create KB hook
teedonk Mar 22, 2026
b9e6ab7
test(knowledge): update KB detail tests for Ollama support
teedonk Mar 22, 2026
b1e92b8
test(knowledge): update search tests for provider routing
teedonk Mar 22, 2026
3698a04
fix(knowledge): separate validation from runtime model info to preven…
teedonk Mar 22, 2026
988158e
fix(knowledge): parameterize query vector and accept transaction handle
teedonk Mar 22, 2026
166a7f3
fix(knowledge): wrap Ollama delete+insert in transaction with status …
teedonk Mar 22, 2026
2f30934
fix(knowledge): clean up orphaned KB on table creation failure
teedonk Mar 22, 2026
f88e9f9
fix(knowledge): replace native select with project Select component
teedonk Mar 23, 2026
863e497
fix(knowledge): sort and trim Ollama results to topK
teedonk Mar 23, 2026
546061e
fix(knowledge): restrict Ollama base URL to localhost and private net…
teedonk Mar 23, 2026
00b3c7d
fix(knowledge): filter deleted documents from Ollama search and dedup…
teedonk Mar 23, 2026
075b005
fix(knowledge): use OLLAMA_URL env var and allow Docker hostnames in …
teedonk Mar 23, 2026
ea59193
fix(knowledge): align dynamic table SQL types with shared schema
teedonk Mar 23, 2026
ee3cc30
fix(knowledge): remove hardcoded OpenAI defaults from updateKnowledge…
teedonk Mar 23, 2026
e6d0a60
fix(knowledge): add enabled field and fix token ratio for Ollama embe…
teedonk Mar 23, 2026
0812f3b
fix(knowledge): remove immutable fields from update schema
teedonk Mar 23, 2026
fd8d2b3
fix(knowledge): strengthen SSRF validation for Ollama base URL
teedonk Mar 23, 2026
5c872c4
fix(knowledge): remove dead code and fix Record type in search route
teedonk Mar 23, 2026
4571299
fix(knowledge): add missing dynamic-tables mock in test
teedonk Mar 23, 2026
322dc4e
fix(knowledge): block IPv6-mapped IPv4 SSRF bypass and fix ::1 hostna…
teedonk Mar 23, 2026
ef84871
fix(knowledge): use KB embedding model for search and fix single-resu…
teedonk Mar 23, 2026
d308fe0
fix(knowledge): preserve ollamaBaseUrl when updating chunkingConfig
teedonk Mar 23, 2026
aa452f4
fix(knowledge): validate Ollama auto-detected dimension against bounds
teedonk Mar 23, 2026
8445d7e
merge: resolve conflicts with upstream staging
teedonk Mar 24, 2026
185007a
fix(knowledge): prevent SSRF bypass via hostname prefix matching on d…
teedonk Mar 24, 2026
456eaa4
resolve merge conflict in create-base-modal
teedonk Mar 27, 2026
1570b02
fix(knowledge): validate dimension before sql.raw interpolation
teedonk Mar 29, 2026
0e1dcf7
fix(knowledge): remove any casts in search route
teedonk Mar 29, 2026
e2b8189
Merge remote-tracking branch 'origin/staging' into feat/ollama-embedd…
teedonk Mar 29, 2026
ea3dd08
fix(knowledge): add missing document filters to Ollama search queries
teedonk Mar 29, 2026
24779a7
fix(knowledge): preserve Ollama embedding table on soft delete
teedonk Mar 29, 2026
547de40
fix(knowledge): wrap BETWEEN compound conditions in parentheses
teedonk Mar 29, 2026
7afb708
fix(knowledge): wrap BETWEEN compound conditions in parentheses
teedonk Mar 29, 2026
2cdb519
fix(knowledge): add retry to Ollama search embedding generation
teedonk Mar 29, 2026
507cc36
docs(knowledge): clarify soft-delete table retention rationale
teedonk Mar 29, 2026
50858d4
fix(knowledge): validate UUID format in kbTableName
teedonk Mar 29, 2026
5bdfe15
chore: merge staging into feat/ollama-embedding-support
teedonk Apr 2, 2026
f6d121e
fix(knowledge): hard-delete KB row on creation rollback
teedonk Apr 2, 2026
d210669
fix(knowledge): use hardDeleteKnowledgeBase in cleanup path
teedonk Apr 2, 2026
ff08fb0
fix(knowledge): align drizzle schema id type to uuid
teedonk Apr 2, 2026
5cebdea
fix(knowledge): clamp single-result distance instead of forcing zero
teedonk Apr 2, 2026
2552edc
fix: use global score normalization across all providers
teedonk Apr 2, 2026
c6fde92
fix: validate embedding count matches chunk count before insert
teedonk Apr 2, 2026
71b1769
fix: prevent NaN on Ollama dimension input
teedonk Apr 2, 2026
61d7936
fix: correct overlap chunk size unit in JSDoc comment
teedonk Apr 2, 2026
1991604
fix: only normalize scores when mixing OpenAI and Ollama providers
teedonk Apr 2, 2026
dbabedd
fix: batch Ollama embeddings by item count, not cumulative chars
teedonk Apr 2, 2026
0f42820
fix: resolve merge conflicts with staging
teedonk Apr 6, 2026
9e3d8ce
fix: rename kbModelName to avoid duplicate variable declaration
teedonk Apr 6, 2026
dec517d
fix: restrict .internal SSRF allowlist to host.docker.internal only
teedonk Apr 6, 2026
551dff9
fix: remove dead generateSearchEmbedding re-export from utils.ts
teedonk Apr 6, 2026
952de73
fix: model name prefix, SSRF re-validation, and chunk KB config lookup
teedonk Apr 6, 2026
97c0c71
fix: route chunk ops to per-KB table for Ollama and tighten SSRF allo…
teedonk Apr 6, 2026
578171c
fix: normalize cross-provider scores per-provider instead of globally
teedonk Apr 6, 2026
e89e3c2
fix: null-coalesce embeddingDimension and update stale error message
teedonk Apr 6, 2026
c527867
fix: validate resolved Ollama URL including env fallback against SSRF…
teedonk Apr 6, 2026
7bcee72
fix: guard Ollama SSRF check by provider and skip normalization for s…
teedonk Apr 6, 2026
5638d3a
fix: normalize IPv6 hostname brackets and validate resolved Ollama UR…
teedonk Apr 6, 2026
2c8cbb4
fix: use provider-specific token estimation for manual chunks
teedonk Apr 6, 2026
08e2b24
fix: add SSRF guard inside generateEmbeddings and generateSearchEmbed…
teedonk Apr 6, 2026
045824a
fix: null guard on KB lookup and Ollama-aware token estimation in Jso…
teedonk Apr 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat(knowledge): add Ollama chunk size and overlap capping
  • Loading branch information
teedonk committed Mar 22, 2026
commit 18e7ac2ca9fe3d0cf6d5775da6993222c3f4e54f
121 changes: 89 additions & 32 deletions apps/sim/lib/knowledge/documents/service.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,12 @@ import { getStorageMethod, isRedisStorage } from '@/lib/core/storage'
import { processDocument } from '@/lib/knowledge/documents/document-processor'
import { DocumentProcessingQueue } from '@/lib/knowledge/documents/queue'
import type { DocumentSortField, SortOrder } from '@/lib/knowledge/documents/types'
import { generateEmbeddings } from '@/lib/knowledge/embeddings'
import {
deleteKBDocumentEmbeddings,
insertKBEmbeddings,
parseEmbeddingModel,
} from '@/lib/knowledge/dynamic-tables'
import { generateEmbeddings, getOllamaModelContextLength } from '@/lib/knowledge/embeddings'
import {
buildUndefinedTagsError,
parseBooleanValue,
Expand Down Expand Up @@ -410,6 +415,8 @@ export async function processDocumentAsync(
userId: knowledgeBase.userId,
workspaceId: knowledgeBase.workspaceId,
chunkingConfig: knowledgeBase.chunkingConfig,
embeddingModel: knowledgeBase.embeddingModel,
embeddingDimension: knowledgeBase.embeddingDimension,
})
.from(knowledgeBase)
.where(eq(knowledgeBase.id, knowledgeBaseId))
Expand All @@ -430,19 +437,60 @@ export async function processDocumentAsync(

logger.info(`[${documentId}] Status updated to 'processing', starting document processor`)

const kbConfig = kb[0].chunkingConfig as { maxSize: number; minSize: number; overlap: number }
const kbConfig = kb[0].chunkingConfig as {
maxSize: number
minSize: number
overlap: number
ollamaBaseUrl?: string
}
const { provider: embeddingProvider, modelName: embeddingModelName } = parseEmbeddingModel(
kb[0].embeddingModel
)

// For Ollama models, query the model's context length and cap chunk size accordingly.
// TextChunker uses ratio 3 for Ollama (1 estimated token = 3 chars), but the actual
// Ollama tokenizer may produce ~1 token per 1-2 chars (especially for PDF text with
// special characters). We use 30% of context length as the safe estimated-token limit
// so the resulting character count stays well within the model's actual token limit.
let effectiveChunkSize = processingOptions.chunkSize ?? kbConfig.maxSize
let effectiveOverlap = processingOptions.chunkOverlap ?? kbConfig.overlap
let ollamaContextLength: number | undefined
if (embeddingProvider === 'ollama') {
ollamaContextLength = await getOllamaModelContextLength(
embeddingModelName,
kbConfig.ollamaBaseUrl
)
const safeChunkSize = Math.floor(ollamaContextLength * 0.3)
if (effectiveChunkSize > safeChunkSize) {
logger.info(
`[${documentId}] Capping chunk size from ${effectiveChunkSize} to ${safeChunkSize} tokens ` +
`(Ollama model ${embeddingModelName} context length: ${ollamaContextLength})`
)
effectiveChunkSize = safeChunkSize
}
// Cap overlap to 20% of effective chunk size so overlap doesn't push chunks over context limit
const maxOverlap = Math.max(0, Math.floor(effectiveChunkSize * 0.2))
if (effectiveOverlap > maxOverlap) {
logger.info(
`[${documentId}] Capping chunk overlap from ${effectiveOverlap} to ${maxOverlap} tokens ` +
`(20% of effective chunk size ${effectiveChunkSize})`
)
effectiveOverlap = maxOverlap
}
}

await withTimeout(
(async () => {
const processed = await processDocument(
docData.fileUrl,
docData.filename,
docData.mimeType,
processingOptions.chunkSize ?? kbConfig.maxSize,
processingOptions.chunkOverlap ?? kbConfig.overlap,
effectiveChunkSize,
effectiveOverlap,
processingOptions.minCharactersPerChunk ?? kbConfig.minSize,
kb[0].userId,
kb[0].workspaceId
kb[0].workspaceId,
kb[0].embeddingModel
)

if (processed.chunks.length > LARGE_DOC_CONFIG.MAX_CHUNKS_PER_DOCUMENT) {
Expand Down Expand Up @@ -472,7 +520,13 @@ export async function processDocumentAsync(
const batchNum = Math.floor(i / batchSize) + 1

logger.info(`[${documentId}] Processing embedding batch ${batchNum}/${totalBatches}`)
const batchEmbeddings = await generateEmbeddings(batch, undefined, kb[0].workspaceId)
const batchEmbeddings = await generateEmbeddings(
batch,
kb[0].embeddingModel,
kb[0].workspaceId,
kbConfig.ollamaBaseUrl,
ollamaContextLength
)
for (const emb of batchEmbeddings) {
embeddings.push(emb)
}
Expand Down Expand Up @@ -523,7 +577,7 @@ export async function processDocumentAsync(
contentLength: chunk.text.length,
tokenCount: Math.ceil(chunk.text.length / 4),
embedding: embeddings[chunkIndex] || null,
embeddingModel: 'text-embedding-3-small',
embeddingModel: embeddingModelName,
Comment thread
cursor[bot] marked this conversation as resolved.
Comment thread
cursor[bot] marked this conversation as resolved.
startOffset: chunk.metadata.startIndex,
endOffset: chunk.metadata.endIndex,
// Copy text tags from document (7 slots)
Expand Down Expand Up @@ -551,34 +605,37 @@ export async function processDocumentAsync(
updatedAt: now,
}))

await db.transaction(async (tx) => {
if (embeddingRecords.length > 0) {
await tx.delete(embedding).where(eq(embedding.documentId, documentId))

const insertBatchSize = LARGE_DOC_CONFIG.MAX_CHUNKS_PER_BATCH
const batches: (typeof embeddingRecords)[] = []
for (let i = 0; i < embeddingRecords.length; i += insertBatchSize) {
batches.push(embeddingRecords.slice(i, i + insertBatchSize))
}
if (embeddingRecords.length > 0) {
logger.info(`[${documentId}] Inserting ${embeddingRecords.length} embeddings`)

logger.info(`[${documentId}] Inserting ${embeddingRecords.length} embeddings`)
for (const batch of batches) {
await tx.insert(embedding).values(batch)
}
if (embeddingProvider === 'ollama') {
// Per-KB table: delete old chunks then bulk-insert new ones
await deleteKBDocumentEmbeddings(knowledgeBaseId, documentId)
await insertKBEmbeddings(knowledgeBaseId, embeddingRecords, kb[0].embeddingDimension)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-atomic delete + insert risks data loss

The Ollama path deletes all existing embeddings for the document and then inserts new ones without a wrapping transaction. If insertKBEmbeddings fails mid-way (e.g., after a partial batch insert), the old embeddings have already been deleted but the new ones are only partially written — leaving the document with fewer embeddings or none at all, with no way to recover automatically.

The OpenAI path below correctly wraps both operations in a db.transaction. Consider wrapping the Ollama path similarly. Since deleteKBDocumentEmbeddings and insertKBEmbeddings both use db.execute / db.insert, they should participate in a transaction too.

} else {
// Shared embedding table: delete + insert inside a transaction
await db.transaction(async (tx) => {
await tx.delete(embedding).where(eq(embedding.documentId, documentId))

const insertBatchSize = LARGE_DOC_CONFIG.MAX_CHUNKS_PER_BATCH
for (let i = 0; i < embeddingRecords.length; i += insertBatchSize) {
await tx.insert(embedding).values(embeddingRecords.slice(i, i + insertBatchSize))
}
})
}
}

await tx
.update(document)
.set({
chunkCount: processed.metadata.chunkCount,
tokenCount: processed.metadata.tokenCount,
characterCount: processed.metadata.characterCount,
processingStatus: 'completed',
processingCompletedAt: now,
processingError: null,
})
.where(eq(document.id, documentId))
})
await db
.update(document)
.set({
chunkCount: processed.metadata.chunkCount,
tokenCount: processed.metadata.tokenCount,
characterCount: processed.metadata.characterCount,
processingStatus: 'completed',
processingCompletedAt: now,
processingError: null,
})
.where(eq(document.id, documentId))
Comment thread
cursor[bot] marked this conversation as resolved.
Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Document status update moved outside the transaction (regression)

In the original code, both the embedding inserts and the processingStatus: 'completed' update were inside a single db.transaction, so they were atomic. This PR extracts the status update to a separate await db.update(document)... call that runs after the transaction.

For the OpenAI path, if the transaction (embedding inserts) succeeds but the subsequent status update fails, the document stays permanently in 'processing' state — even though its embeddings are fully in place. There's no retry that would recover this (the catch block at line 656 sets status to 'failed', which is also incorrect since the embeddings are already there).

This is a regression for the OpenAI path that was introduced by the refactor to support the Ollama code path. The processingStatus update should remain inside the db.transaction for the OpenAI path.

})(),
TIMEOUTS.OVERALL_PROCESSING,
'Document processing'
Expand Down