improvement(knowledge): eliminate N+1 on tag definitions in bulk upload#4681
Conversation
createDocumentRecords previously called processDocumentTags per-doc, each running a SELECT against knowledge_base_tag_definitions — N queries that all returned the same kbId-scoped rows. Worse, those reads used the global db pool while the tx held a FOR UPDATE lock on the KB row, risking pool contention on large bulk uploads. Split the helper into loadTagDefinitions (single query, accepts the tx as executor) and resolveDocumentTags (pure, takes the pre-loaded Map). The bulk path loads once inside the transaction; createSingleDocument loads once outside its tx. Same throw-on-validation-error semantics preserved.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview
Tests update the DB mock to support the new join-based prefetch used by Reviewed by Cursor Bugbot for commit 4ce7939. Configure here. |
Greptile SummaryThis PR eliminates an N+1 query pattern in
Confidence Score: 5/5Safe to merge — a focused query consolidation with no behavioral regressions; error-handling and validation semantics are identical to the pre-refactor path. The refactor correctly loads tag definitions once per batch on the transaction connection, eliminating the N+1 pattern without changing what happens when tags are invalid or missing. The workspace archivedAt filter in the new LEFT JOIN matches what getWorkspaceBilledAccountUserId already used, so billing lookups behave identically. The tag-free batch short-circuit is correct. No logic paths were broken. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Caller
participant CDR as createDocumentRecords (bulk)
participant LTD as loadTagDefinitions
participant RDT as resolveDocumentTags
participant DB as Database (tx)
C->>CDR: documents[], knowledgeBaseId
CDR->>DB: SELECT 1 FROM knowledge_base FOR UPDATE
CDR->>DB: "SELECT id FROM knowledge_base WHERE id=? (tx)"
note over CDR: hasTaggedDocs check
alt at least one doc has documentTagsData
CDR->>LTD: knowledgeBaseId, tx
LTD->>DB: "SELECT * FROM knowledge_base_tag_definitions WHERE kb_id=? (once, on tx)"
DB-->>LTD: tag definitions[]
LTD-->>CDR: "Map<name, TagDefinition>"
else no tagged docs
note over CDR: skip query entirely
end
loop for each document
CDR->>RDT: tagData[], tagDefinitions (Map), requestId
note over RDT: pure, synchronous
RDT-->>CDR: ProcessedDocumentTags
end
CDR->>DB: INSERT documents[]
DB-->>C: DocumentData[]
Reviews (2): Last reviewed commit: "lint" | Re-trigger Greptile |
… JOIN processDocumentAsync was issuing three separate SELECTs per processed document: knowledge_base (config), workspace (billing settings), and document (tag values). For a typical Trigger.dev fleet processing thousands of docs, that's thousands of redundant pool checkouts. Collapsed into a single JOIN at the top of processDocumentAsync that fetches kb config + billed account user + document tag values in one roundtrip. The post-embedding tag SELECT (which previously held tags through the full embedding-generation wait) is gone; tags from the initial prefetch are reused. Behavior: - Missing/archived/deleted document or KB → same 'failed' status outcome as before, single consolidated error message. - Missing billed account → preserves existing error. - All 208 KB tests pass (test mock extended for innerJoin/leftJoin).
… tags Trim verbose comments in the same pass.
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 4ce7939. Configure here.
Summary
createDocumentRecords: every doc with tags ran its ownSELECTagainstknowledge_base_tag_definitions, all returning the sameknowledgeBaseId-scoped rows.dbpool while the surrounding transaction held aFOR UPDATElock on the KB row — risking pool contention during large bulk uploads.loadTagDefinitions(single query, accepts the tx as executor) andresolveDocumentTags(pure function, takes the pre-loadedMap).createSingleDocumentkeeps the same call pattern (load once outside its tx).Impact
For a 500-doc bulk upload where every doc has tags: 499 fewer
SELECTs onknowledge_base_tag_definitions, plus 499 fewer pool checkouts taken while holding the KB FOR UPDATE lock.Type of Change
Testing
lib/knowledge/**,app/api/knowledge/**).Checklist