improvement(sockets): make offline mode recoverable and stop transient races tripping it#4980
Conversation
…t races tripping it
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview The operation queue now treats socket emits that return Socket emits ( Realtime server: several Collaboration: remote block/variable removals cancel queued ops for those targets. New operation-queue tests cover skip-emit, recovery, and offline triggers. Reviewed by Cursor Bugbot for commit 8f4aa4d. Configure here. |
|
@greptile |
Greptile SummaryThis PR improves the real-time collaborative editing layer by making offline mode recoverable and preventing transient socket races from permanently tripping it. The changes span the server (retryable classifications, DB-error rethrow in permissions), the socket provider (join-blocked state, boolean-returning emit functions), the operation queue (pending reversion on skipped emits, missing-target drop at retry exhaustion), and collaborative-workflow hooks (pre-emptive cancellation on remote block/variable deletions).
Confidence Score: 5/5Safe to merge — the operation queue mechanics, recovery paths, and server-side retryable classifications are well-reasoned and backed by new tests. The core changes (boolean emit return, pending reversion, dropOperationForMissingTarget at retry exhaustion, clearError on workspace switch) are correct and tested. The one non-blocking finding is that the action-bar tooltip shows 'Read-only mode' instead of a connection-error message when the workflow join is blocked, because isOfflineMode:false is passed through in that state — users still see the correct context via the separate toast, so no data or permission boundary is affected. workspace-permissions-provider.tsx — the isOfflineMode field passed in the join-blocked branch may mislead action-bar tooltip copy. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[addToQueue] --> B[processNextOperation]
B --> C{emitFn returns true?}
C -- false / skipped --> D[revert op to pending\nisProcessing = false]
D --> E[wait for registerEmitFunctions]
E --> B
C -- true / emitted --> F[set op processing\nstart timeout]
F --> G{Server response}
G -- operation-confirmed --> H[confirmOperation\nremove op, process next]
G -- operation-failed retryable:true --> I{retryCount < max?}
I -- yes --> J[increment retry\nschedule delay → processNext]
J --> B
I -- no / exhausted --> K[dropOperationForMissingTarget]
K -- target gone locally --> L[drop op\nprocess next]
K -- target still present --> M[triggerOfflineMode\nhasOperationError = true]
G -- operation-failed retryable:false --> N[dropOperationForMissingTarget]
N -- target gone --> L
N -- target present --> M
M --> O{Recovery path}
O -- workspace switch --> P[resetWorkflowStores\nclearError]
O -- page refresh --> Q[full reload]
R[join-workflow-error\nnon-retryable] --> S[cancelOperationsForWorkflow\nsetBlockedJoinWorkflowId]
T[join-workflow-success] --> U[setBlockedJoinWorkflowId = null]
style M fill:#f66,color:#fff
style L fill:#6c6,color:#fff
style H fill:#6c6,color:#fff
Reviews (3): Last reviewed commit: "code cleanup" | Re-trigger Greptile |
Greptile SummaryThis PR addresses two related reliability problems with the real-time collaborative socket layer: (1) transient DB errors in
Confidence Score: 4/5Safe to merge; the core reconnection logic is well-tested and the offline-mode recovery path is straightforward. The main uncertainty is around the "Workflow not found" code path being retryable without a matching cancelOperationsForWorkflow call for workflow deletions. The PR correctly fixes two independent races (transient DB error → permanent denial, and emit-before-room-ready → spurious timeout). The new tests cover the added paths. The retryable flag change for "Workflow not found" and "Block/Variable no longer exists" is intentional and has a client-side backstop (dropOperationForMissingTarget), but there is no cancelOperationsForWorkflow call in the collaborative hook for workflow deletions, unlike the block/variable deletion paths which were explicitly wired up in this PR. In a workflow-deletion race the client would exhaust retries before dropping or triggering offline mode, and the delay is non-trivial (~12 s for subblock ops). apps/realtime/src/handlers/subblocks.ts and variables.ts — the "Workflow not found" retryable flag and the absence of a cancelOperationsForWorkflow hook in use-collaborative-workflow.ts. Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Client
participant OQ as OperationQueue
participant S as SocketProvider
participant R as Realtime Server
participant DB as Database
Note over C,DB: Transient race – room not yet visible
C->>OQ: addToQueue(op)
OQ->>S: emitWorkflowOperation()
S-->>OQ: false (room not joined)
OQ->>OQ: "status = pending, isProcessing = false"
Note over C,DB: Socket reconnects / registerEmitFunctions called
S->>OQ: registerEmitFunctions() → processNextOperation()
OQ->>S: emitWorkflowOperation()
S-->>OQ: true (room now joined)
OQ->>R: workflow-operation event
R->>DB: persist changes
R-->>OQ: operation-confirmed
Note over C,DB: DB error in verifyWorkflowAccess (now throws)
C->>R: join-workflow
R->>DB: verifyWorkflowAccess()
DB-->>R: throws error (transient)
R-->>C: join-workflow-error retryable true
Note over C,DB: Remote block deletion – cancel before failure arrives
R-->>C: batch-remove-blocks event
C->>OQ: cancelOperationsForBlock(id)
OQ->>OQ: remove pending ops for block
Note over C,DB: Offline mode recovery
C->>C: workspace switch / rejoin
C->>OQ: clearError()
OQ->>OQ: "hasOperationError = false"
C->>C: clearOfflineNotification() + re-arm
|
|
@greptile |
Summary
Recover from offline mode on rejoin/workspace switch and stop transient socket races from tripping it
Type of Change
Testing
Tested manually
Checklist