Skip to content

fix(realtime): re-validate socket role and evict revoked collaborators#5050

Merged
waleedlatif1 merged 1 commit into
stagingfrom
worktree-fix+realtime-stale-role-revocation
Jun 15, 2026
Merged

fix(realtime): re-validate socket role and evict revoked collaborators#5050
waleedlatif1 merged 1 commit into
stagingfrom
worktree-fix+realtime-stale-role-revocation

Conversation

@waleedlatif1

Copy link
Copy Markdown
Collaborator

Summary

  • Realtime Socket.io authorized workflow access only at join-workflow and cached the workspace role in presence — a removed or downgraded collaborator kept live read/write access until they disconnected.
  • Re-validate the cached role against the permissions table on mutating events (workflow-operation, subblock-update, variable-update), bounded by a short TTL; refresh the role or evict the socket on change.
  • Add internal POST /api/permissions-updated so the app reconciles active rooms — evicting revoked users (cross-pod via socketsLeave) and refreshing downgraded roles.
  • Notify realtime from workspace member removal and permission-change routes.

Type of Change

  • Bug fix (security)

Testing

  • Added unit tests: authorizeSocketOperation (TTL/revoke/downgrade/transient-error/pre-upgrade presence), reconcileWorkspaceAccessChange (evict vs. refresh), and the /api/permissions-updated endpoint.
  • Full realtime suite: 123 passing. tsc clean (realtime + sim). Boundary/prune/api-validation/biome pass.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Socket.io authorized workflow access only at join and cached the workspace
role in presence, so a removed or downgraded collaborator kept live read/
write access until they disconnected.

- Re-validate the cached role against the permissions table on mutating
  events, bounded by a short TTL; refresh or evict on change
- Add /api/permissions-updated so the app reconciles active rooms, evicting
  revoked users (cross-pod) and refreshing downgraded roles
- Notify realtime on workspace member removal and permission changes
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 15, 2026 4:11am

Request Review

@cursor

cursor Bot commented Jun 15, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes authorization and eviction on the live collaboration path; mistakes could wrongly evict users or briefly allow stale write access, though TTL re-validation and push reconciliation are designed to limit exposure.

Overview
Fixes a gap where workflow access was only checked at join-workflow, so removed or downgraded collaborators could keep mutating over an open socket until they disconnected.

Per-operation authorization: Mutating paths (workflow-operation, subblock-update, variable-update) now call authorizeSocketOperation, which trusts the cached presence role for 15s then re-reads the live permissions table, updates presence via updateUserRole, and on full revocation returns ACCESS_REVOKED and runs evictRevokedSocket. Transient DB errors during re-validation fall back to the cached role instead of locking users out.

Push reconciliation: The main app calls notifyWorkspaceAccessChanged → realtime POST /api/permissions-updatedreconcileWorkspaceAccessChange, which for active rooms in the workspace either evicts all of the user’s sockets (with workflow-permissions-revoked, cross-pod socketsLeave) or refreshes the cached role on downgrade. Wired from workspace permission PATCH (only when the role actually changed) and member removal.

Presence now tracks roleCheckedAt on join; Redis gets an atomic Lua role update so role refreshes don’t race activity updates.

Reviewed by Cursor Bugbot for commit 5c9d711. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5c9d711. Configure here.

return 1
`

/**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Activity update clobbers role

Medium Severity

Concurrent updateUserActivity and updateUserRole both read-modify-write the full presence JSON in Redis. If activity was read before a role refresh but written after, the stale snapshot can overwrite a downgraded or refreshed role and roleCheckedAt, so mutating ops may skip revalidation until the TTL expires.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5c9d711. Configure here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a race here — both updateUserActivity and updateUserRole run as EVALSHA Lua scripts, and Redis executes Lua atomically (script execution is serialized; nothing interleaves between a script's HGET and HSET).

UPDATE_ACTIVITY_SCRIPT re-reads the presence blob inside its own atomic block (HGETcjson.decode → set only cursor/selection/lastActivityHSET cjson.encode), so it always preserves whatever role/roleCheckedAt are current at that instant. A concurrent UPDATE_ROLE_SCRIPT can't run between activity's read and write, so it can't be clobbered by a stale snapshot.

The only staleness window is the intended ≤15s role-revalidation TTL (and the push reconciliation makes full revocation immediate). Leaving the scripts as-is.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR closes a security gap where a revoked or downgraded collaborator kept live Socket.IO read/write access until they disconnected, because the socket role was only checked once at join-workflow. The fix adds two complementary layers: a TTL-bounded re-validation (15 s) on every mutating socket event and a push-based reconciliation endpoint that the main app calls immediately after a permission change.

  • Pull-based (TTL): authorizeSocketOperation in permissions.ts re-queries the permissions table when the cached roleCheckedAt is stale; a DB error falls back to the cached role so legitimate users aren't locked out during a blip.
  • Push-based: notifyWorkspaceAccessChanged is called from both the member-removal and permission-update routes; the realtime server reconciles all active workflow rooms via reconcileWorkspaceAccessChange, evicting sockets where access was fully revoked and refreshing the cached role for downgrades.
  • Redis: A new atomic Lua script (UPDATE_ROLE_SCRIPT) patches role and roleCheckedAt in the presence hash without clobbering concurrent activity updates.

Confidence Score: 3/5

The core eviction and TTL re-validation logic is correct and well-tested, but notifyWorkspaceAccessChanged lacks a request timeout while being synchronously awaited inside the permission-change REST handlers — a stalled realtime connection can hang those API responses indefinitely.

The security fix itself is sound and the two-layer design (pull TTL + push reconcile) is solid. The main concern is that the new HTTP notification helper is awaited in both the member-removal and permission-update routes without any AbortSignal deadline. In a scenario where the realtime server accepts the TCP connection but stalls, the admin-facing REST API response hangs with no bound. The redundant per-workflow DB queries in reconcileWorkspaceAccessChange and the TTL thundering-herd are real but lower-urgency operational concerns.

apps/sim/lib/workspaces/permissions/realtime.ts (missing fetch timeout) and apps/realtime/src/rooms/access.ts (redundant DB queries per workflow room).

Important Files Changed

Filename Overview
apps/realtime/src/middleware/permissions.ts New authorizeSocketOperation function adds TTL-bounded DB re-validation of cached roles; thundering herd possible at TTL boundary for sockets with burst traffic.
apps/sim/lib/workspaces/permissions/realtime.ts New best-effort notifier to the realtime server; lacks an AbortSignal timeout, so a stalled realtime connection can block the permission-change API response indefinitely.
apps/realtime/src/rooms/access.ts New reconcileWorkspaceAccessChange function; calls verifyWorkflowAccess per workflow room (two DB queries each) when a single workspace-permission lookup would suffice for all workflows.
apps/realtime/src/routes/http.ts New POST /api/permissions-updated endpoint; protected by the existing checkInternalApiKey guard but accepts empty strings for workspaceId/userId.
apps/realtime/src/rooms/redis-manager.ts New updateUserRole Lua script atomically patches role and roleCheckedAt in the presence hash; NOSCRIPT retry path is correct.
apps/realtime/src/handlers/operations.ts Replaces synchronous checkRolePermission with async authorizeSocketOperation; eviction path on accessRevoked is complete and correct.

Sequence Diagram

sequenceDiagram
    participant Admin as Admin (PATCH /permissions)
    participant SimAPI as sim API
    participant RealtimeHTTP as Realtime HTTP
    participant RoomManager as RoomManager
    participant Socket as Revoked Socket

    Admin->>SimAPI: PATCH /api/workspaces/[id]/permissions
    SimAPI->>SimAPI: Commit DB transaction (update permissions row)
    SimAPI->>RealtimeHTTP: POST /api/permissions-updated
    RealtimeHTTP->>RoomManager: handleWorkspaceAccessChange()
    RoomManager->>RoomManager: reconcileWorkspaceAccessChange()
    RoomManager->>SimAPI: verifyWorkflowAccess(userId, workflowId)
    alt Access revoked
        RoomManager->>Socket: emit(workflow-permissions-revoked)
        RoomManager->>RoomManager: removeUserFromRoom()
        RoomManager->>Socket: socketsLeave(workflowId)
    else Role downgraded
        RoomManager->>RoomManager: updateUserRole() [Lua atomic]
        RoomManager->>RoomManager: broadcastPresenceUpdate()
    end
    RealtimeHTTP-->>SimAPI: 200 OK
    SimAPI-->>Admin: 200 OK

    Note over Socket: TTL path on next mutating event
    Socket->>RoomManager: workflow-operation
    RoomManager->>RoomManager: authorizeSocketOperation()
    alt roleCheckedAt stale
        RoomManager->>SimAPI: verifyWorkflowAccess() [DB re-validation]
        alt Access revoked
            RoomManager->>Socket: emit(ACCESS_REVOKED)
            RoomManager->>RoomManager: evictRevokedSocket()
        else Role changed
            RoomManager->>RoomManager: updateUserRole() [Lua atomic]
        end
    end
Loading

Reviews (1): Last reviewed commit: "fix(realtime): re-validate socket role a..." | Re-trigger Greptile

Comment on lines +20 to +33
const response = await fetch(`${getSocketServerUrl()}/api/permissions-updated`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': env.INTERNAL_API_SECRET,
},
body: JSON.stringify({ workspaceId, userId }),
})

if (!response.ok) {
logger.warn(
`Failed to notify realtime of access change for user ${userId} in workspace ${workspaceId} (${response.status})`
)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing fetch timeout blocks the caller

notifyWorkspaceAccessChanged is awaited synchronously in both the PATCH /api/workspaces/[id]/permissions handler and the DELETE /api/workspaces/members/[id] handler. If the realtime server accepts the TCP connection but never sends a response (e.g., it is under load, mid-restart, or hanging), fetch has no built-in deadline and the permission-change HTTP response will hang indefinitely — contradicting the "failures are logged but never block the permission mutation" promise in the JSDoc.

Add an AbortSignal.timeout() to bound the worst case: signal: AbortSignal.timeout(3000)

Comment on lines +48 to +59
const access = await verifyWorkflowAccess(userId, workflowId)

if (access.hasAccess) {
const role = access.role || 'read'
for (const target of targets) {
await manager.updateUserRole(workflowId, target.socketId, role)
}
await manager.broadcastPresenceUpdate(workflowId)
logger.info(
`Refreshed cached role to '${role}' for user ${userId} on workflow ${workflowId}`
)
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant per-workflow DB queries for workspace-level permission

verifyWorkflowAccess issues two DB queries per call: one to look up the workflow row and one to check the workspace permission via authorizeWorkflowByWorkspacePermission. Because workspace permission is a single entry in the permissions table that applies uniformly to every workflow in the workspace, calling it inside the per-workflow loop means the same permission row is fetched N times — once for each active room the user is present in. For a user simultaneously in 10+ workflow rooms, reconciliation makes 20+ sequential DB round-trips when 2 would suffice.

A targeted optimization would check the workspace-level access once before the loop and skip the per-workflow permission query, using only the workflow-existence check inside the loop.

Comment on lines +159 to +162
if (typeof workspaceId !== 'string' || typeof userId !== 'string') {
sendError(res, 'workspaceId and userId are required', 400)
return
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty strings bypass the type check and reach the room manager

typeof workspaceId !== 'string' accepts "" as valid, which would then call handleWorkspaceAccessChange('', ''). The DB query filters by workspaceId, so it returns 0 rows and the call is harmless today — but an empty-string guard costs nothing and makes the contract explicit.

Suggested change
if (typeof workspaceId !== 'string' || typeof userId !== 'string') {
sendError(res, 'workspaceId and userId are required', 400)
return
}
if (typeof workspaceId !== 'string' || typeof userId !== 'string' ||
!workspaceId || !userId) {
sendError(res, 'workspaceId and userId are required', 400)
return
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +119 to +140
const isStale = Date.now() - lastChecked >= ROLE_REVALIDATION_TTL_MS

if (isStale) {
try {
const access = await verifyWorkflowAccess(userId, workflowId)
if (!access.hasAccess) {
return {
allowed: false,
role,
reason: 'Access to this workflow has been revoked',
accessRevoked: true,
}
}
role = access.role || 'read'
await roomManager.updateUserRole(workflowId, socketId, role)
} catch (error) {
logger.warn(
`Failed to re-validate role for user ${userId} on workflow ${workflowId}; reusing cached role`,
error
)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Thundering herd of DB queries at each TTL boundary

When a socket is actively sending events (e.g., a large paste generates tens of workflow-operation frames in rapid succession), all inflight handlers snapshot presence.roleCheckedAt before any of them can write it back. Every handler sees isStale = true and concurrently calls verifyWorkflowAccess (two DB round-trips each). The first response to return then updates the room-manager timestamp, but the remaining in-flight queries are already underway.

Because TTL expiry is aligned to wall-clock time and batched edits are common in collaborative editors, this burst fires on every 15-second boundary for active sockets. The side-effect is harmless but wastes DB connections. A per-socket in-memory flag (e.g., a Set keyed by socketId) that is cleared on completion could collapse concurrent revalidations to one query.

@waleedlatif1 waleedlatif1 merged commit 4ab8760 into staging Jun 15, 2026
15 checks passed
@waleedlatif1 waleedlatif1 deleted the worktree-fix+realtime-stale-role-revocation branch June 15, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant