Skip to content

improvement(knowledge): eliminate N+1 on tag definitions in bulk upload#4681

Merged
waleedlatif1 merged 4 commits into
stagingfrom
waleedlatif1/dublin-v3
May 20, 2026
Merged

improvement(knowledge): eliminate N+1 on tag definitions in bulk upload#4681
waleedlatif1 merged 4 commits into
stagingfrom
waleedlatif1/dublin-v3

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Found an N+1 in createDocumentRecords: every doc with tags ran its own SELECT against knowledge_base_tag_definitions, all returning the same knowledgeBaseId-scoped rows.
  • Worse, those reads used the global db pool while the surrounding transaction held a FOR UPDATE lock on the KB row — risking pool contention during large bulk uploads.
  • Split the helper into loadTagDefinitions (single query, accepts the tx as executor) and resolveDocumentTags (pure function, takes the pre-loaded Map).
  • Bulk upload now loads tag definitions once per request using the transaction's connection.
  • createSingleDocument keeps the same call pattern (load once outside its tx).
  • Same throw-on-validation-error semantics preserved.

Impact

For a 500-doc bulk upload where every doc has tags: 499 fewer SELECTs on knowledge_base_tag_definitions, plus 499 fewer pool checkouts taken while holding the KB FOR UPDATE lock.

Type of Change

  • Performance improvement

Testing

  • All 208 KB unit tests pass (lib/knowledge/**, app/api/knowledge/**).
  • No behavioral change: validation throws on the same inputs as before.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

createDocumentRecords previously called processDocumentTags per-doc, each
running a SELECT against knowledge_base_tag_definitions — N queries that
all returned the same kbId-scoped rows. Worse, those reads used the
global db pool while the tx held a FOR UPDATE lock on the KB row, risking
pool contention on large bulk uploads.

Split the helper into loadTagDefinitions (single query, accepts the tx as
executor) and resolveDocumentTags (pure, takes the pre-loaded Map). The
bulk path loads once inside the transaction; createSingleDocument loads
once outside its tx. Same throw-on-validation-error semantics preserved.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 20, 2026 9:05pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 20, 2026

PR Summary

Medium Risk
Touches core document processing and bulk upload paths by changing DB query shapes (joins and cached tag definition lookups), which could affect processing/billing behavior if the new joins or cached definitions are incorrect. Main intent is performance, but it changes how missing/deleted records are detected and how billing user IDs are sourced.

Overview
Reduces DB round-trips in knowledge document flows. Bulk document creation now loads knowledge_base_tag_definitions once per batch (and via the transaction executor) and resolves per-document tags via a new pure resolveDocumentTags helper.

processDocumentAsync now prefetches knowledge base config, workspace billing user, and document tag fields in a single documentknowledge_base JOIN (removing separate reads), reuses the prefetched tag values for embedding rows, and updates the failure message/where-clause when the document/KB no longer exists.

Tests update the DB mock to support the new join-based prefetch used by processDocumentAsync.

Reviewed by Cursor Bugbot for commit 4ce7939. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR eliminates an N+1 query pattern in createDocumentRecords (bulk upload) by splitting the former processDocumentTags async function into loadTagDefinitions (single query, runs on the transaction connection) and resolveDocumentTags (pure synchronous function). Additionally, processDocumentAsync consolidates three separate SELECTs (KB config, workspace billing, document tag columns) into one JOIN, removing the secondary document read that happened deep inside the processing pipeline.

  • createDocumentRecords: tag definitions are now loaded once per batch using the active transaction's executor, and only when at least one document carries documentTagsData; tag-free batches skip the query entirely.
  • createSingleDocument: unchanged call-pattern — loadTagDefinitions runs outside the transaction with the global db pool, matching the pre-refactor behavior.
  • processDocumentAsync: three queries collapsed into one JOIN across document, knowledge_base, and workspace; removes the import of getWorkspaceBilledAccountUserId (which already filtered on isNull(archivedAt), consistent with the new LEFT JOIN condition).

Confidence Score: 5/5

Safe to merge — a focused query consolidation with no behavioral regressions; error-handling and validation semantics are identical to the pre-refactor path.

The refactor correctly loads tag definitions once per batch on the transaction connection, eliminating the N+1 pattern without changing what happens when tags are invalid or missing. The workspace archivedAt filter in the new LEFT JOIN matches what getWorkspaceBilledAccountUserId already used, so billing lookups behave identically. The tag-free batch short-circuit is correct. No logic paths were broken.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/knowledge/documents/service.ts Core refactor: splits processDocumentTags into loadTagDefinitions (single tx-aware query) and resolveDocumentTags (pure function); collapses three SELECTs in processDocumentAsync into one JOIN. Logic and error semantics preserved.
apps/sim/app/api/knowledge/utils.test.ts Test mock extended with innerJoin/leftJoin/where chain to cover the new combined query in processDocumentAsync; existing test coverage unaffected.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant CDR as createDocumentRecords (bulk)
    participant LTD as loadTagDefinitions
    participant RDT as resolveDocumentTags
    participant DB as Database (tx)

    C->>CDR: documents[], knowledgeBaseId
    CDR->>DB: SELECT 1 FROM knowledge_base FOR UPDATE
    CDR->>DB: "SELECT id FROM knowledge_base WHERE id=? (tx)"
    note over CDR: hasTaggedDocs check
    alt at least one doc has documentTagsData
        CDR->>LTD: knowledgeBaseId, tx
        LTD->>DB: "SELECT * FROM knowledge_base_tag_definitions WHERE kb_id=? (once, on tx)"
        DB-->>LTD: tag definitions[]
        LTD-->>CDR: "Map<name, TagDefinition>"
    else no tagged docs
        note over CDR: skip query entirely
    end
    loop for each document
        CDR->>RDT: tagData[], tagDefinitions (Map), requestId
        note over RDT: pure, synchronous
        RDT-->>CDR: ProcessedDocumentTags
    end
    CDR->>DB: INSERT documents[]
    DB-->>C: DocumentData[]
Loading

Reviews (2): Last reviewed commit: "lint" | Re-trigger Greptile

Comment thread apps/sim/lib/knowledge/documents/service.ts Outdated
… JOIN

processDocumentAsync was issuing three separate SELECTs per processed
document: knowledge_base (config), workspace (billing settings), and
document (tag values). For a typical Trigger.dev fleet processing
thousands of docs, that's thousands of redundant pool checkouts.

Collapsed into a single JOIN at the top of processDocumentAsync that
fetches kb config + billed account user + document tag values in one
roundtrip. The post-embedding tag SELECT (which previously held tags
through the full embedding-generation wait) is gone; tags from the
initial prefetch are reused.

Behavior:
- Missing/archived/deleted document or KB → same 'failed' status outcome
  as before, single consolidated error message.
- Missing billed account → preserves existing error.
- All 208 KB tests pass (test mock extended for innerJoin/leftJoin).
… tags

Trim verbose comments in the same pass.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 4ce7939. Configure here.

@waleedlatif1 waleedlatif1 merged commit 4ca7651 into staging May 20, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/dublin-v3 branch May 21, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant