Skip to content

improvement(sockets): make offline mode recoverable and stop transient races tripping it#4980

Merged
icecrasher321 merged 3 commits into
stagingfrom
improvement/sockets-notif-clear
Jun 11, 2026
Merged

improvement(sockets): make offline mode recoverable and stop transient races tripping it#4980
icecrasher321 merged 3 commits into
stagingfrom
improvement/sockets-notif-clear

Conversation

@icecrasher321

Copy link
Copy Markdown
Collaborator

Summary

Recover from offline mode on rejoin/workspace switch and stop transient socket races from tripping it

Type of Change

  • Bug fix

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 11, 2026 11:41pm

Request Review

@cursor

cursor Bot commented Jun 11, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches collaborative save path, offline/read-only gating, and realtime failure semantics; behavior change is intentional but affects core editing reliability.

Overview
Makes collaborative offline mode recoverable and stops transient socket/realtime races from locking the editor read-only.

The operation queue now treats socket emits that return false (room not joined/visible) as “not sent”: operations stay pending instead of timing out into offline mode, and retry when the room becomes joinable. Failed ops for blocks/variables already removed locally are dropped without tripping offline mode (including variables). Offline state clears on successful workflow join, workspace switch, and via existing clearError.

Socket emits (emitWorkflowOperation, subblock/variable) return boolean success. UI: the “Connection unavailable” toast dismisses when offline mode recovers.

Realtime server: several operation-failed cases (missing session, workflow not found, block/variable gone) are marked retryable; verifyWorkflowAccess rethrows DB errors so join can surface retryable failures instead of permanent denial.

Collaboration: remote block/variable removals cancel queued ops for those targets. New operation-queue tests cover skip-emit, recovery, and offline triggers.

Reviewed by Cursor Bugbot for commit 8f4aa4d. Configure here.

@icecrasher321

Copy link
Copy Markdown
Collaborator Author

@greptile

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR improves the real-time collaborative editing layer by making offline mode recoverable and preventing transient socket races from permanently tripping it. The changes span the server (retryable classifications, DB-error rethrow in permissions), the socket provider (join-blocked state, boolean-returning emit functions), the operation queue (pending reversion on skipped emits, missing-target drop at retry exhaustion), and collaborative-workflow hooks (pre-emptive cancellation on remote block/variable deletions).

  • Server: verifyWorkflowAccess now rethrows DB errors instead of returning hasAccess: false, cleanly separating transient failures from genuine access denials; several previously non-retryable failure codes are now retryable to absorb timing races.
  • Client operation queue: Emit functions return a boolean; a false result reverts the operation to pending (no timeout) rather than starting a processing cycle, so operations queued before a room join are held safely and replayed once the socket is ready.
  • Recovery: clearError() is called on workspace switch so a tripped offline mode does not bleed into a new workspace; a new blockedJoinWorkflowId state separately represents a non-retryable join failure without conflating it with the operation-failure offline mode.

Confidence Score: 5/5

Safe to merge — the operation queue mechanics, recovery paths, and server-side retryable classifications are well-reasoned and backed by new tests.

The core changes (boolean emit return, pending reversion, dropOperationForMissingTarget at retry exhaustion, clearError on workspace switch) are correct and tested. The one non-blocking finding is that the action-bar tooltip shows 'Read-only mode' instead of a connection-error message when the workflow join is blocked, because isOfflineMode:false is passed through in that state — users still see the correct context via the separate toast, so no data or permission boundary is affected.

workspace-permissions-provider.tsx — the isOfflineMode field passed in the join-blocked branch may mislead action-bar tooltip copy.

Important Files Changed

Filename Overview
apps/realtime/src/handlers/subblocks.ts Three failure paths flipped from retryable:false to retryable:true — 'User session not found' (transient race, mitigated by SESSION_ERROR rejoin), 'Workflow not found' (permanent DB absence, noted in prior review comments), 'Block no longer exists' (permanent, but now mitigated by cancelOperationsForBlock on the remote-event path).
apps/realtime/src/handlers/variables.ts Symmetric retryable changes to subblocks.ts; 'Variable no longer exists' is now retryable:true, mitigated by the new cancelOperationsForVariable call in use-collaborative-workflow.ts.
apps/realtime/src/middleware/permissions.ts verifyWorkflowAccess now rethrows DB errors instead of returning hasAccess:false, correctly separating transient DB failures (rethrown → retryable) from genuine access denials (hasAccess:false → permanent).
apps/sim/app/workspace/providers/socket-provider.tsx Adds blockedJoinWorkflowId state to track non-retryable join failures; emit functions now return boolean indicating whether the socket send was actually performed, enabling pending-op preservation on skipped emits.
apps/sim/app/workspace/[workspaceId]/providers/workspace-permissions-provider.tsx Refactors toast management into a reusable usePersistentErrorToast hook; adds join-blocked handling; isOfflineMode:false is passed through when only isJoinBlocked is true, making the action-bar tooltip display 'Read-only mode' instead of a connection error in that state.
apps/sim/stores/operation-queue/store.ts Key improvements: emit-skipped ops revert to pending (no timeout set), dropOperationForMissingTarget now also runs at retry exhaustion for retryable failures, and isVariableStillPresent check added for variable-scoped ops.
apps/sim/stores/operation-queue/types.ts Extracts named emit function types (WorkflowOperationEmit, SubblockUpdateEmit, VariableUpdateEmit) with boolean return to signal whether the payload was actually sent; eliminates duplicate inline type signatures.
apps/sim/hooks/use-collaborative-workflow.ts Adds cancelOperationsForVariable on remote VARIABLE_OPERATIONS.REMOVE and cancelOperationsForBlock for each id in a batch-remove-blocks event, preventing orphaned in-flight operations from reaching retry-exhaustion and tripping offline mode.
apps/sim/stores/operation-queue/store.test.ts Four new test cases covering: skipped-emit pending reversion, re-emit after room becomes joinable, non-retryable offline mode + clearError recovery, and retry-exhaustion offline mode.
apps/sim/stores/workflows/registry/store.ts clearError() called on workspace switch after resetWorkflowStores(), ensuring a previously tripped offline mode does not carry over to the new workspace.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[addToQueue] --> B[processNextOperation]
    B --> C{emitFn returns true?}
    C -- false / skipped --> D[revert op to pending\nisProcessing = false]
    D --> E[wait for registerEmitFunctions]
    E --> B
    C -- true / emitted --> F[set op processing\nstart timeout]
    F --> G{Server response}
    G -- operation-confirmed --> H[confirmOperation\nremove op, process next]
    G -- operation-failed retryable:true --> I{retryCount < max?}
    I -- yes --> J[increment retry\nschedule delay → processNext]
    J --> B
    I -- no / exhausted --> K[dropOperationForMissingTarget]
    K -- target gone locally --> L[drop op\nprocess next]
    K -- target still present --> M[triggerOfflineMode\nhasOperationError = true]
    G -- operation-failed retryable:false --> N[dropOperationForMissingTarget]
    N -- target gone --> L
    N -- target present --> M

    M --> O{Recovery path}
    O -- workspace switch --> P[resetWorkflowStores\nclearError]
    O -- page refresh --> Q[full reload]

    R[join-workflow-error\nnon-retryable] --> S[cancelOperationsForWorkflow\nsetBlockedJoinWorkflowId]
    T[join-workflow-success] --> U[setBlockedJoinWorkflowId = null]

    style M fill:#f66,color:#fff
    style L fill:#6c6,color:#fff
    style H fill:#6c6,color:#fff
Loading

Reviews (3): Last reviewed commit: "code cleanup" | Re-trigger Greptile

Comment thread apps/sim/stores/operation-queue/store.ts
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR addresses two related reliability problems with the real-time collaborative socket layer: (1) transient DB errors in verifyWorkflowAccess were silently downgraded to permanent access denials instead of being retried, and (2) a race between socket connection and room visibility caused operations to start a confirmation timeout that would never fire, eventually tripping offline mode.

  • verifyWorkflowAccess now rethrows DB errors; the join-workflow handler already had a try/catch that maps thrown errors to a retryable join-workflow-error, so the fix propagates correctly.
  • Emit functions now return a boolean; processNextOperation reverts the operation to pending (no timeout started) when the emit is skipped, and registerEmitFunctions re-triggers processing on reconnect/rejoin so the operation is flushed once the room is available.
  • cancelOperationsForBlock and cancelOperationsForVariable are called before local store deletions in the collaborative hook so in-flight operations are cleaned up before the server's failure response arrives.
  • Offline mode is now recoverable: clearError() is called on workspace switch, and the persistent offline toast is dismissed and re-armed when hasOperationError clears.

Confidence Score: 4/5

Safe to merge; the core reconnection logic is well-tested and the offline-mode recovery path is straightforward. The main uncertainty is around the "Workflow not found" code path being retryable without a matching cancelOperationsForWorkflow call for workflow deletions.

The PR correctly fixes two independent races (transient DB error → permanent denial, and emit-before-room-ready → spurious timeout). The new tests cover the added paths. The retryable flag change for "Workflow not found" and "Block/Variable no longer exists" is intentional and has a client-side backstop (dropOperationForMissingTarget), but there is no cancelOperationsForWorkflow call in the collaborative hook for workflow deletions, unlike the block/variable deletion paths which were explicitly wired up in this PR. In a workflow-deletion race the client would exhaust retries before dropping or triggering offline mode, and the delay is non-trivial (~12 s for subblock ops).

apps/realtime/src/handlers/subblocks.ts and variables.ts — the "Workflow not found" retryable flag and the absence of a cancelOperationsForWorkflow hook in use-collaborative-workflow.ts.

Important Files Changed

Filename Overview
apps/sim/stores/operation-queue/store.ts Core change: emit functions now return boolean; false reverts operation to pending instead of starting a timeout. Adds dropOperationForMissingTarget (variable-aware) and calls useOperationQueueStore.setState() directly from outside the store, which is inconsistent with the pattern used elsewhere.
apps/realtime/src/middleware/permissions.ts verifyWorkflowAccess now rethrows DB errors instead of returning {hasAccess: false}; the sole caller (workflow.ts) already has a try/catch that emits retryable: true, so this is safe and correct.
apps/realtime/src/handlers/subblocks.ts Three operation-failed codes changed from retryable:false to retryable:true; "Workflow not found" and "Block no longer exists" are now retryable despite being definitive DB results.
apps/realtime/src/handlers/variables.ts Mirrors the subblocks.ts retryable changes for variable-update operations.
apps/sim/app/workspace/providers/socket-provider.tsx Emit functions return boolean; default context stubs updated to return false; emitWorkflowOperation position-update path correctly returns true.
apps/sim/app/workspace/[workspaceId]/providers/workspace-permissions-provider.tsx Offline toast now tracked by ref for explicit dismissal; hasShownOfflineNotification reset on recovery; cleanup effect now dismisses both notification types on unmount.
apps/sim/hooks/use-collaborative-workflow.ts cancelOperationsForBlock and cancelOperationsForVariable now called before local store deletions to cancel in-flight operations before server failure responses arrive.
apps/sim/stores/workflows/registry/store.ts clearError() called on workspace switch so a previously tripped offline mode flag doesn't carry over into the new workspace.
apps/sim/stores/operation-queue/store.test.ts Four new tests cover: skipped-emit reverts to pending, pending op is retried after room becomes joinable, non-retryable failure triggers offline mode and clearError recovers, retryable failure triggers offline mode only after exhausting retries.

Sequence Diagram

sequenceDiagram
    participant C as Client
    participant OQ as OperationQueue
    participant S as SocketProvider
    participant R as Realtime Server
    participant DB as Database

    Note over C,DB: Transient race – room not yet visible
    C->>OQ: addToQueue(op)
    OQ->>S: emitWorkflowOperation()
    S-->>OQ: false (room not joined)
    OQ->>OQ: "status = pending, isProcessing = false"

    Note over C,DB: Socket reconnects / registerEmitFunctions called
    S->>OQ: registerEmitFunctions() → processNextOperation()
    OQ->>S: emitWorkflowOperation()
    S-->>OQ: true (room now joined)
    OQ->>R: workflow-operation event
    R->>DB: persist changes
    R-->>OQ: operation-confirmed

    Note over C,DB: DB error in verifyWorkflowAccess (now throws)
    C->>R: join-workflow
    R->>DB: verifyWorkflowAccess()
    DB-->>R: throws error (transient)
    R-->>C: join-workflow-error retryable true

    Note over C,DB: Remote block deletion – cancel before failure arrives
    R-->>C: batch-remove-blocks event
    C->>OQ: cancelOperationsForBlock(id)
    OQ->>OQ: remove pending ops for block

    Note over C,DB: Offline mode recovery
    C->>C: workspace switch / rejoin
    C->>OQ: clearError()
    OQ->>OQ: "hasOperationError = false"
    C->>C: clearOfflineNotification() + re-arm
Loading

Comments Outside Diff (1)

  1. apps/realtime/src/handlers/subblocks.ts, line 248-256 (link)

    P2 "Workflow not found" is a definitive DB result, not a transient one

    Unlike the verifyWorkflowAccess path (which now correctly rethrows DB errors), this check is an explicit query that confirms the workflow is absent from the database. Marking it retryable: true means the client will attempt up to 5 retries (~12 s for subblock/variable ops) before dropOperationForMissingTarget runs. There is no corresponding cancelOperationsForWorkflow call in use-collaborative-workflow.ts to pre-emptively clear pending ops when a workflow is deleted — unlike block deletions (where cancelOperationsForBlock fires on the remote event before the server failure reaches the client). If the local store still shows the blocks after those retries exhaust (e.g. the deletion event hasn't been processed), offline mode will be triggered for every pending subblock operation against that workflow.

    The same gap applies to flushVariableUpdate in variables.ts.

Reviews (2): Last reviewed commit: "data persistence issues should trigger o..." | Re-trigger Greptile

Comment thread apps/sim/stores/operation-queue/store.ts
@icecrasher321

Copy link
Copy Markdown
Collaborator Author

@greptile

@icecrasher321 icecrasher321 merged commit d9f78c0 into staging Jun 11, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/sockets-notif-clear branch June 12, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant