fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556
fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556waleedlatif1 wants to merge 2 commits into
Conversation
…o clear stuck-yellow task tiles copilot_chats.conversation_id has no TTL/heartbeat, so when a stream process dies before the clear path runs (pod OOM, SIGKILL, uncaught throw, deploy mid-stream) the column is orphaned and the task tile renders yellow forever. The Redis lock at copilot:chat-stream-lock:<chatId> is the canonical liveness signal and self-heals via 60s TTL + 20s heartbeat, but the mothership APIs weren't consulting it. Adds read-time reconciliation: a batched MGET helper checks whether each persisted conversation_id still has a live Redis lock, and both GET /api/mothership/chats and GET /api/mothership/chats/[chatId] rewrite the marker to null when the lock has expired. No DB writes; stuck rows self-heal on next fetch.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Adds Reviewed by Cursor Bugbot for commit fca291b. Configure here. |
Greptile SummaryThis PR adds read-time reconciliation at the mothership API boundary to fix stuck-yellow task tiles caused by orphaned
Confidence Score: 5/5Safe to merge. The change is purely additive read-side logic with no DB writes; the worst-case failure mode (Redis down) degrades gracefully to the pre-existing behaviour. The reconciliation logic is narrow in scope, internally consistent with the existing No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant ListAPI as GET /mothership/chats
participant ChatAPI as GET /mothership/chats/[chatId]
participant DB as Postgres
participant Redis as Redis
Note over Client,Redis: List route
Client->>ListAPI: "GET ?workspaceId=xxx"
ListAPI->>DB: SELECT id, activeStreamId
DB-->>ListAPI: chats with some stale conversationId
ListAPI->>Redis: MGET stream-lock keys batched
Redis-->>ListAPI: null or owner per key
ListAPI->>ListAPI: null out entries with no lock
ListAPI-->>Client: chats with stale markers cleared
Note over Client,Redis: Single-chat route
Client->>ChatAPI: GET /chats/chatId
ChatAPI->>DB: getAccessibleCopilotChat
DB-->>ChatAPI: chat with conversationId set
ChatAPI->>Redis: MGET stream-lock for chatId
Redis-->>ChatAPI: null lock expired
ChatAPI->>ChatAPI: "liveConversationId = null"
ChatAPI-->>Client: chat with conversationId null
Reviews (2): Last reviewed commit: "test(mothership): clarify test name to r..." | Re-trigger Greptile |
…amIds is called with empty candidateIds
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit fca291b. Configure here.
Summary
The bug
Mothership task tiles render yellow/amber when the chat has an "active stream" (
copilot_chats.conversation_id IS NOT NULL). Several tasks in production were stuck yellow with no stream actually running.Root cause
copilot_chats.conversation_idis the persistent active-stream marker. It is written when a stream starts and cleared on the finalize/stop callbacks (finalizeAssistantTurn,/chat/stop). It has no TTL and no heartbeat, so if the process dies before the clear path runs (pod OOM, SIGKILL, uncaught throw, deploy mid-stream), the column is orphaned forever and the task renders yellow indefinitely.The Redis lock at
copilot:chat-stream-lock:<chatId>is the canonical liveness signal — it self-heals via 60s TTL + 20s heartbeat fromstartAbortPoller. The API routes that powered the UI were reading the DB column directly without consulting Redis.The fix
Read-time reconciliation at the API boundary: check whether the persisted
conversation_idhas a live Redis lock; if not, treat the marker asnull.getActiveChatStreamIds(chatIds[])inlib/copilot/request/session/abort.ts— batched MGET againstcopilot:chat-stream-lock:<chatId>keys. Mirrors the existinggetPendingChatStreamIdpattern (local pending map + Redis fallback, graceful degradation on Redis-null or Redis-throws).GET /api/mothership/chatsfilters chats withactiveStreamId != null, MGETs their locks, rewritesactiveStreamId = nullfor any whose lock has expired.GET /api/mothership/chats/[chatId]does the same reconciliation per-chat before reading the stream snapshot — prevents the client from trying to reconnect to a dead stream viause-chat.ts.No DB writes; the DB stays the persistent record, Redis remains the source of truth for liveness. Already-stuck rows self-heal on the next fetch.
Type of Change
Testing
lib/copilot/request/session/abort.test.ts— 5 new cases forgetActiveChatStreamIds(empty input, mixed live/expired locks, all expired, Redis unavailable, Redis throws)app/api/mothership/chats/route.test.ts(new) — 4 cases: stuck-yellow reconciled, no Redis lookup when no candidates, live locks pass through, 401 unauthenticatedapp/api/mothership/chats/[chatId]/route.test.ts(new) — 5 cases: stuck conversationId nulled and no snapshot read, live lock returns snapshot, null conversationId short-circuits Redis call, 404, 40120 tests passing.
bun run check:api-validationpasses.Checklist