Skip to content

fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556

Open
waleedlatif1 wants to merge 2 commits into
stagingfrom
waleedlatif1/task-yellow-bug
Open

fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556
waleedlatif1 wants to merge 2 commits into
stagingfrom
waleedlatif1/task-yellow-bug

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

The bug

Mothership task tiles render yellow/amber when the chat has an "active stream" (copilot_chats.conversation_id IS NOT NULL). Several tasks in production were stuck yellow with no stream actually running.

Root cause

copilot_chats.conversation_id is the persistent active-stream marker. It is written when a stream starts and cleared on the finalize/stop callbacks (finalizeAssistantTurn, /chat/stop). It has no TTL and no heartbeat, so if the process dies before the clear path runs (pod OOM, SIGKILL, uncaught throw, deploy mid-stream), the column is orphaned forever and the task renders yellow indefinitely.

The Redis lock at copilot:chat-stream-lock:<chatId> is the canonical liveness signal — it self-heals via 60s TTL + 20s heartbeat from startAbortPoller. The API routes that powered the UI were reading the DB column directly without consulting Redis.

The fix

Read-time reconciliation at the API boundary: check whether the persisted conversation_id has a live Redis lock; if not, treat the marker as null.

  • New helper getActiveChatStreamIds(chatIds[]) in lib/copilot/request/session/abort.ts — batched MGET against copilot:chat-stream-lock:<chatId> keys. Mirrors the existing getPendingChatStreamId pattern (local pending map + Redis fallback, graceful degradation on Redis-null or Redis-throws).
  • GET /api/mothership/chats filters chats with activeStreamId != null, MGETs their locks, rewrites activeStreamId = null for any whose lock has expired.
  • GET /api/mothership/chats/[chatId] does the same reconciliation per-chat before reading the stream snapshot — prevents the client from trying to reconnect to a dead stream via use-chat.ts.

No DB writes; the DB stays the persistent record, Redis remains the source of truth for liveness. Already-stuck rows self-heal on the next fetch.

Type of Change

  • Bug fix

Testing

  • lib/copilot/request/session/abort.test.ts — 5 new cases for getActiveChatStreamIds (empty input, mixed live/expired locks, all expired, Redis unavailable, Redis throws)
  • app/api/mothership/chats/route.test.ts (new) — 4 cases: stuck-yellow reconciled, no Redis lookup when no candidates, live locks pass through, 401 unauthenticated
  • app/api/mothership/chats/[chatId]/route.test.ts (new) — 5 cases: stuck conversationId nulled and no snapshot read, live lock returns snapshot, null conversationId short-circuits Redis call, 404, 401

20 tests passing. bun run check:api-validation passes.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…o clear stuck-yellow task tiles

copilot_chats.conversation_id has no TTL/heartbeat, so when a stream
process dies before the clear path runs (pod OOM, SIGKILL, uncaught
throw, deploy mid-stream) the column is orphaned and the task tile
renders yellow forever. The Redis lock at copilot:chat-stream-lock:<chatId>
is the canonical liveness signal and self-heals via 60s TTL + 20s
heartbeat, but the mothership APIs weren't consulting it.

Adds read-time reconciliation: a batched MGET helper checks whether
each persisted conversation_id still has a live Redis lock, and both
GET /api/mothership/chats and GET /api/mothership/chats/[chatId]
rewrite the marker to null when the lock has expired. No DB writes;
stuck rows self-heal on next fetch.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 11, 2026 7:44pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

PR Summary

Medium Risk
Changes API responses for mothership chat listing and chat detail by masking stale conversationId/activeStreamId based on Redis lock state; incorrect reconciliation or Redis edge cases could hide a legitimately active stream, though the logic degrades gracefully and is covered by tests.

Overview
Fixes the stuck-yellow mothership task/chat state by reconciling the persisted stream marker (copilot_chats.conversation_id) with the canonical Redis chat-stream lock at read time.

Adds getActiveChatStreamIds() (batched Redis MGET + local pending fallback) and uses it in GET /api/mothership/chats and GET /api/mothership/chats/[chatId] to null out stale activeStreamId/conversationId and to avoid reading stream snapshots for dead streams. New Vitest coverage validates reconciliation, short-circuit behavior, and auth/404 handling.

Reviewed by Cursor Bugbot for commit fca291b. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR adds read-time reconciliation at the mothership API boundary to fix stuck-yellow task tiles caused by orphaned copilot_chats.conversation_id rows. When a stream-hosting pod dies before reaching its finalize/stop path, the DB column is never cleared; the fix consults the Redis lock (self-healing via 60s TTL) and rewrites conversationId/activeStreamId to null on the way out — no DB writes required.

  • New getActiveChatStreamIds(chatIds[]) in abort.ts performs a batched Redis MGET against copilot:chat-stream-lock:<chatId> keys, mirroring the existing getPendingChatStreamId pattern with graceful degradation when Redis is unavailable.
  • GET /api/mothership/chats reconciles all chats with a non-null activeStreamId before returning, nulling out any whose Redis lock has expired.
  • GET /api/mothership/chats/[chatId] applies the same single-chat reconciliation before reading the stream snapshot, preventing the client from attempting to reconnect to a dead stream.

Confidence Score: 5/5

Safe to merge. The change is purely additive read-side logic with no DB writes; the worst-case failure mode (Redis down) degrades gracefully to the pre-existing behaviour.

The reconciliation logic is narrow in scope, internally consistent with the existing getPendingChatStreamId pattern, and guarded by comprehensive tests covering all meaningful fallback paths. No writes to the DB are introduced, so regressions cannot corrupt persistent state.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/copilot/request/session/abort.ts Adds getActiveChatStreamIds — a batched Redis MGET helper that mirrors the existing getPendingChatStreamId contract; error handling and graceful degradation are consistent with the established pattern.
apps/sim/app/api/mothership/chats/route.ts Adds reconciliation block after the DB query: candidate IDs (non-null activeStreamId) are batched into a single MGET; chats whose lock has expired are rewritten to activeStreamId: null before the response.
apps/sim/app/api/mothership/chats/[chatId]/route.ts Per-chat reconciliation mirrors the list route; liveConversationId is computed once and used consistently throughout snapshot loading, transcript building, and the response payload.
apps/sim/lib/copilot/request/session/abort.test.ts Five new cases for getActiveChatStreamIds cover empty input, mixed live/expired locks, all-expired, Redis unavailable, and Redis throws — good coverage of all fallback paths.
apps/sim/app/api/mothership/chats/route.test.ts Four cases covering stuck-yellow reconciliation, no-candidate short-circuit, all-live pass-through, and 401 unauthenticated — comprehensive route-level coverage.
apps/sim/app/api/mothership/chats/[chatId]/route.test.ts Five cases covering stuck-conversationId null-out (with readEvents guard), live lock pass-through, null-conversationId short-circuit, 404, and 401 — all key branches exercised.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ListAPI as GET /mothership/chats
    participant ChatAPI as GET /mothership/chats/[chatId]
    participant DB as Postgres
    participant Redis as Redis

    Note over Client,Redis: List route
    Client->>ListAPI: "GET ?workspaceId=xxx"
    ListAPI->>DB: SELECT id, activeStreamId
    DB-->>ListAPI: chats with some stale conversationId
    ListAPI->>Redis: MGET stream-lock keys batched
    Redis-->>ListAPI: null or owner per key
    ListAPI->>ListAPI: null out entries with no lock
    ListAPI-->>Client: chats with stale markers cleared

    Note over Client,Redis: Single-chat route
    Client->>ChatAPI: GET /chats/chatId
    ChatAPI->>DB: getAccessibleCopilotChat
    DB-->>ChatAPI: chat with conversationId set
    ChatAPI->>Redis: MGET stream-lock for chatId
    Redis-->>ChatAPI: null lock expired
    ChatAPI->>ChatAPI: "liveConversationId = null"
    ChatAPI-->>Client: chat with conversationId null
Loading

Reviews (2): Last reviewed commit: "test(mothership): clarify test name to r..." | Re-trigger Greptile

Comment thread apps/sim/app/api/mothership/chats/route.test.ts Outdated
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit fca291b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants