improvement(execution, connectors): offload large function inputs, increase connector limits + better error propagation#5089
Conversation
…onnector size limits Addresses a class of 10 MB limit failures: - executor/variables: offload over-budget function block-output context values to durable large-value refs (lazy `sim.values.read`) so JS function blocks can merge medium files without exceeding the 10 MB inter-block request-body cap. - connectors: stream downloads via `readBodyWithLimit` (memory-safe), and surface oversized files as visible `failed` KB documents instead of silently dropping them — listing-time for github/s3/dropbox/onedrive/sharepoint, fetch-time for gitlab/azure/google-drive via a shared `ConnectorFileTooLargeError`. Raise the per-file cap from a hardcoded 10 MB to the canonical 100 MB KB document limit (`CONNECTOR_MAX_FILE_BYTES`), except Google Drive's export path (Google's hard 10 MB export-API limit). - sync-engine: `classifyExternalDoc` + bulk `skipDocuments` (failed rows with a reason, excluded from retry), byte-bounded batch concurrency to cap peak worker memory at the raised cap, and a `metadata.fileSize ?? size` fallback.
# Conflicts: # apps/sim/connectors/utils.test.ts
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR SummaryMedium Risk Overview Function blocks: When resolved block-output values would exceed a ~6 MB combined inline budget (data + display in the function request), the resolver offloads them to durable large-value refs and rewrites JS code to load via KB connectors: Introduces shared Sync engine: Adds Documentation in the memory-load-check skill is extended with KB connector size-handling guidance; tests cover utils, sync classification/chunking, and function-context offload. Reviewed by Cursor Bugbot for commit 26cf668. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 26cf668. Configure here.
| let documents = supportedFiles.map(fileToStub) | ||
| let documents = candidateFiles.map((entry) => | ||
| stubOrSkipBySize(fileToStub(entry), entry.size, MAX_FILE_SIZE) | ||
| ) |
There was a problem hiding this comment.
maxFiles counts oversized skips
Medium Severity
Oversized files are now kept in connector listings as skipped stubs, but the same maxFiles / maxObjects counters still treat them like normal listed documents. A cap can be exhausted by failed skip rows before indexable files are ever listed, which regresses sync coverage compared to when oversized paths were filtered out before counting.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 26cf668. Configure here.
| logger.warn('Failed to offload oversized function context value; keeping inline', { | ||
| error: toError(error).message, | ||
| }) | ||
| return null |
There was a problem hiding this comment.
Offload failure keeps oversized inline
Medium Severity
When a function block context value exceeds the inline budget, maybeOffloadInlineFunctionContextValue stores it via storeLargeValue. If that store fails, the code logs and returns null, so resolution falls back to inlining the full value anyway—recreating the ~10 MB request-body overflow this change is meant to prevent.
Reviewed by Cursor Bugbot for commit 26cf668. Configure here.
Greptile SummaryThis PR fixes two classes of 10 MB limit failures: function blocks with large block-output context values now offload oversized refs to durable storage and read them lazily in the sandbox, while KB connector downloads are byte-capped with
Confidence Score: 4/5Safe to merge with awareness of two edge-case gaps that do not affect the happy path. The connector streaming and skip-visibility changes are well-tested and correct across all nine connectors. The sync engine isNotNull(storageKey) retry guard is safe because addDocument always sets a storage key before writing the DB row. Two gaps exist on error paths: the offload catch block in the resolver does not charge the failed value footprint to the budget, so a storage outage leaves all large values inline; and a deferred update op that becomes too large at hydration time silently skips without incrementing any result counter. apps/sim/executor/variables/resolver.ts (offload catch path) and apps/sim/lib/knowledge/connectors/sync-engine.ts (update-to-skip counter omission). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Connector listDocuments] --> B{size > CONNECTOR_MAX_FILE_BYTES?}
B -- yes --> C[stubOrSkipBySize - skippedReason set]
B -- no --> D[Normal stub - contentDeferred=true]
C --> E[classifyExternalDoc]
D --> E
E -- skip + new --> F[skipDocuments bulk insert - failed row storageKey=null]
E -- skip + existing --> G[unchanged - keep last-known-good]
E -- add/update --> H[chunkOpsByByteBudget 64MB + SYNC_BATCH_SIZE=5]
H --> I[deferredOps - getDocument]
I --> J{fullDoc.skippedReason?}
J -- yes + add op --> K[push to skipExtDocs - skipDocuments called after]
J -- yes + update op --> L[return null - no counter increment]
J -- no --> M[addDocument / updateDocument - storageKey always set]
M --> N[Retry sweep: isNotNull storageKey - excludes failed rows]
subgraph Function Block Offload
P[resolveTemplateCode] --> Q{inline footprint <= budget?}
Q -- yes --> R[store inline - reduce budget]
Q -- no --> S[storeLargeValue]
S -- success --> T[LargeValueRef - sim.values.read in sandbox]
S -- failure --> U[fallback inline - budget NOT updated]
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Connector listDocuments] --> B{size > CONNECTOR_MAX_FILE_BYTES?}
B -- yes --> C[stubOrSkipBySize - skippedReason set]
B -- no --> D[Normal stub - contentDeferred=true]
C --> E[classifyExternalDoc]
D --> E
E -- skip + new --> F[skipDocuments bulk insert - failed row storageKey=null]
E -- skip + existing --> G[unchanged - keep last-known-good]
E -- add/update --> H[chunkOpsByByteBudget 64MB + SYNC_BATCH_SIZE=5]
H --> I[deferredOps - getDocument]
I --> J{fullDoc.skippedReason?}
J -- yes + add op --> K[push to skipExtDocs - skipDocuments called after]
J -- yes + update op --> L[return null - no counter increment]
J -- no --> M[addDocument / updateDocument - storageKey always set]
M --> N[Retry sweep: isNotNull storageKey - excludes failed rows]
subgraph Function Block Offload
P[resolveTemplateCode] --> Q{inline footprint <= budget?}
Q -- yes --> R[store inline - reduce budget]
Q -- no --> S[storeLargeValue]
S -- success --> T[LargeValueRef - sim.values.read in sandbox]
S -- failure --> U[fallback inline - budget NOT updated]
end
|
| if (fullDoc?.skippedReason) { | ||
| if (op.type === 'add') { | ||
| skipExtDocs.push({ | ||
| ...op.extDoc, | ||
| skippedReason: fullDoc.skippedReason, | ||
| contentHash: fullDoc.contentHash ?? op.extDoc.contentHash, | ||
| metadata: { ...op.extDoc.metadata, ...fullDoc.metadata }, | ||
| }) | ||
| } | ||
| return null | ||
| } |
There was a problem hiding this comment.
update op skipped at fetch time produces no result counter increment
When a deferred op.type === 'update' is hydrated and the freshly fetched document carries skippedReason (the file grew past the cap and is only discovered to be oversized at download time), the code correctly preserves the previously-indexed content (last-known-good) and returns null. However, this path increments neither result.docsUnchanged, result.docsFailed, nor any other counter. Every sync that exercises this branch will emit a total (docsAdded + docsUpdated + docsUnchanged + docsFailed) that is smaller than the number of documents seen, making the sync log stats non-auditable. Adding result.docsUnchanged++ before return null here would keep the counters accurate without changing behaviour.


Summary
Fixes a class of 10 MB limit failures across workflow execution and KB connectors.
sim.values.read), so a JS function can merge medium files without busting the 10 MB inter-block request-body cap (the original "Seedance" merge failure).failedKB documents (with a reason) instead of being silently dropped — at listing time (GitHub/S3/Dropbox/OneDrive/SharePoint) and fetch time (GitLab/Azure/Google Drive via a sharedConnectorFileTooLargeError).response.text()downloads replaced with streamingreadBodyWithLimit(cancels past the cap; closes a Dropbox OOM/DoS gap).CONNECTOR_MAX_FILE_BYTES), except Google Drive's export path (Google's hard 10 MB export-API limit).classifyExternalDocclassification, bulkskipDocuments(failed rows, excluded from the stuck-doc retry sweep), byte-bounded batch concurrency so the raised cap can't OOM the worker, and ametadata.fileSize ?? sizefallback so skipped rows show the real size.Type of Change
Testing
WiP
Checklist