fix(realtime): re-validate socket role and evict revoked collaborators#5050
Conversation
Socket.io authorized workflow access only at join and cached the workspace role in presence, so a removed or downgraded collaborator kept live read/ write access until they disconnected. - Re-validate the cached role against the permissions table on mutating events, bounded by a short TTL; refresh or evict on change - Add /api/permissions-updated so the app reconciles active rooms, evicting revoked users (cross-pod) and refreshing downgraded roles - Notify realtime on workspace member removal and permission changes
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryHigh Risk Overview Per-operation authorization: Mutating paths ( Push reconciliation: The main app calls Presence now tracks Reviewed by Cursor Bugbot for commit 5c9d711. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5c9d711. Configure here.
| return 1 | ||
| ` | ||
|
|
||
| /** |
There was a problem hiding this comment.
Activity update clobbers role
Medium Severity
Concurrent updateUserActivity and updateUserRole both read-modify-write the full presence JSON in Redis. If activity was read before a role refresh but written after, the stale snapshot can overwrite a downgraded or refreshed role and roleCheckedAt, so mutating ops may skip revalidation until the TTL expires.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5c9d711. Configure here.
There was a problem hiding this comment.
Not a race here — both updateUserActivity and updateUserRole run as EVALSHA Lua scripts, and Redis executes Lua atomically (script execution is serialized; nothing interleaves between a script's HGET and HSET).
UPDATE_ACTIVITY_SCRIPT re-reads the presence blob inside its own atomic block (HGET → cjson.decode → set only cursor/selection/lastActivity → HSET cjson.encode), so it always preserves whatever role/roleCheckedAt are current at that instant. A concurrent UPDATE_ROLE_SCRIPT can't run between activity's read and write, so it can't be clobbered by a stale snapshot.
The only staleness window is the intended ≤15s role-revalidation TTL (and the push reconciliation makes full revocation immediate). Leaving the scripts as-is.
Greptile SummaryThis PR closes a security gap where a revoked or downgraded collaborator kept live Socket.IO read/write access until they disconnected, because the socket role was only checked once at
Confidence Score: 3/5The core eviction and TTL re-validation logic is correct and well-tested, but The security fix itself is sound and the two-layer design (pull TTL + push reconcile) is solid. The main concern is that the new HTTP notification helper is awaited in both the member-removal and permission-update routes without any apps/sim/lib/workspaces/permissions/realtime.ts (missing fetch timeout) and apps/realtime/src/rooms/access.ts (redundant DB queries per workflow room). Important Files Changed
Sequence DiagramsequenceDiagram
participant Admin as Admin (PATCH /permissions)
participant SimAPI as sim API
participant RealtimeHTTP as Realtime HTTP
participant RoomManager as RoomManager
participant Socket as Revoked Socket
Admin->>SimAPI: PATCH /api/workspaces/[id]/permissions
SimAPI->>SimAPI: Commit DB transaction (update permissions row)
SimAPI->>RealtimeHTTP: POST /api/permissions-updated
RealtimeHTTP->>RoomManager: handleWorkspaceAccessChange()
RoomManager->>RoomManager: reconcileWorkspaceAccessChange()
RoomManager->>SimAPI: verifyWorkflowAccess(userId, workflowId)
alt Access revoked
RoomManager->>Socket: emit(workflow-permissions-revoked)
RoomManager->>RoomManager: removeUserFromRoom()
RoomManager->>Socket: socketsLeave(workflowId)
else Role downgraded
RoomManager->>RoomManager: updateUserRole() [Lua atomic]
RoomManager->>RoomManager: broadcastPresenceUpdate()
end
RealtimeHTTP-->>SimAPI: 200 OK
SimAPI-->>Admin: 200 OK
Note over Socket: TTL path on next mutating event
Socket->>RoomManager: workflow-operation
RoomManager->>RoomManager: authorizeSocketOperation()
alt roleCheckedAt stale
RoomManager->>SimAPI: verifyWorkflowAccess() [DB re-validation]
alt Access revoked
RoomManager->>Socket: emit(ACCESS_REVOKED)
RoomManager->>RoomManager: evictRevokedSocket()
else Role changed
RoomManager->>RoomManager: updateUserRole() [Lua atomic]
end
end
Reviews (1): Last reviewed commit: "fix(realtime): re-validate socket role a..." | Re-trigger Greptile |
| const response = await fetch(`${getSocketServerUrl()}/api/permissions-updated`, { | ||
| method: 'POST', | ||
| headers: { | ||
| 'Content-Type': 'application/json', | ||
| 'x-api-key': env.INTERNAL_API_SECRET, | ||
| }, | ||
| body: JSON.stringify({ workspaceId, userId }), | ||
| }) | ||
|
|
||
| if (!response.ok) { | ||
| logger.warn( | ||
| `Failed to notify realtime of access change for user ${userId} in workspace ${workspaceId} (${response.status})` | ||
| ) | ||
| } |
There was a problem hiding this comment.
Missing fetch timeout blocks the caller
notifyWorkspaceAccessChanged is awaited synchronously in both the PATCH /api/workspaces/[id]/permissions handler and the DELETE /api/workspaces/members/[id] handler. If the realtime server accepts the TCP connection but never sends a response (e.g., it is under load, mid-restart, or hanging), fetch has no built-in deadline and the permission-change HTTP response will hang indefinitely — contradicting the "failures are logged but never block the permission mutation" promise in the JSDoc.
Add an AbortSignal.timeout() to bound the worst case: signal: AbortSignal.timeout(3000)
| const access = await verifyWorkflowAccess(userId, workflowId) | ||
|
|
||
| if (access.hasAccess) { | ||
| const role = access.role || 'read' | ||
| for (const target of targets) { | ||
| await manager.updateUserRole(workflowId, target.socketId, role) | ||
| } | ||
| await manager.broadcastPresenceUpdate(workflowId) | ||
| logger.info( | ||
| `Refreshed cached role to '${role}' for user ${userId} on workflow ${workflowId}` | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Redundant per-workflow DB queries for workspace-level permission
verifyWorkflowAccess issues two DB queries per call: one to look up the workflow row and one to check the workspace permission via authorizeWorkflowByWorkspacePermission. Because workspace permission is a single entry in the permissions table that applies uniformly to every workflow in the workspace, calling it inside the per-workflow loop means the same permission row is fetched N times — once for each active room the user is present in. For a user simultaneously in 10+ workflow rooms, reconciliation makes 20+ sequential DB round-trips when 2 would suffice.
A targeted optimization would check the workspace-level access once before the loop and skip the per-workflow permission query, using only the workflow-existence check inside the loop.
| if (typeof workspaceId !== 'string' || typeof userId !== 'string') { | ||
| sendError(res, 'workspaceId and userId are required', 400) | ||
| return | ||
| } |
There was a problem hiding this comment.
Empty strings bypass the type check and reach the room manager
typeof workspaceId !== 'string' accepts "" as valid, which would then call handleWorkspaceAccessChange('', ''). The DB query filters by workspaceId, so it returns 0 rows and the call is harmless today — but an empty-string guard costs nothing and makes the contract explicit.
| if (typeof workspaceId !== 'string' || typeof userId !== 'string') { | |
| sendError(res, 'workspaceId and userId are required', 400) | |
| return | |
| } | |
| if (typeof workspaceId !== 'string' || typeof userId !== 'string' || | |
| !workspaceId || !userId) { | |
| sendError(res, 'workspaceId and userId are required', 400) | |
| return | |
| } |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| const isStale = Date.now() - lastChecked >= ROLE_REVALIDATION_TTL_MS | ||
|
|
||
| if (isStale) { | ||
| try { | ||
| const access = await verifyWorkflowAccess(userId, workflowId) | ||
| if (!access.hasAccess) { | ||
| return { | ||
| allowed: false, | ||
| role, | ||
| reason: 'Access to this workflow has been revoked', | ||
| accessRevoked: true, | ||
| } | ||
| } | ||
| role = access.role || 'read' | ||
| await roomManager.updateUserRole(workflowId, socketId, role) | ||
| } catch (error) { | ||
| logger.warn( | ||
| `Failed to re-validate role for user ${userId} on workflow ${workflowId}; reusing cached role`, | ||
| error | ||
| ) | ||
| } | ||
| } |
There was a problem hiding this comment.
Thundering herd of DB queries at each TTL boundary
When a socket is actively sending events (e.g., a large paste generates tens of workflow-operation frames in rapid succession), all inflight handlers snapshot presence.roleCheckedAt before any of them can write it back. Every handler sees isStale = true and concurrently calls verifyWorkflowAccess (two DB round-trips each). The first response to return then updates the room-manager timestamp, but the remaining in-flight queries are already underway.
Because TTL expiry is aligned to wall-clock time and batched edits are common in collaborative editors, this burst fires on every 15-second boundary for active sockets. The side-effect is harmless but wastes DB connections. A per-socket in-memory flag (e.g., a Set keyed by socketId) that is cleared on completion could collapse concurrent revalidations to one query.


Summary
join-workflowand cached the workspace role in presence — a removed or downgraded collaborator kept live read/write access until they disconnected.permissionstable on mutating events (workflow-operation,subblock-update,variable-update), bounded by a short TTL; refresh the role or evict the socket on change.POST /api/permissions-updatedso the app reconciles active rooms — evicting revoked users (cross-pod viasocketsLeave) and refreshing downgraded roles.Type of Change
Testing
authorizeSocketOperation(TTL/revoke/downgrade/transient-error/pre-upgrade presence),reconcileWorkspaceAccessChange(evict vs. refresh), and the/api/permissions-updatedendpoint.tscclean (realtime + sim). Boundary/prune/api-validation/biome pass.Checklist