Evict streamable HTTP sessions after failed keep-alive pings#1028
Open
lxq19991111 wants to merge 1 commit into
Open
Evict streamable HTTP sessions after failed keep-alive pings#1028lxq19991111 wants to merge 1 commit into
lxq19991111 wants to merge 1 commit into
Conversation
This was referenced Jun 14, 2026
|
Does it take into account this issue too? #972 |
Author
|
Hi @ShemTovYosef, thanks for pointing this out. Not directly. #972 addresses a transport-level write failure case: a request-specific SSE response write failure should not immediately remove the logical MCP session. This PR is targeting a different layer: logical session eviction after repeated keep-alive ping failures, which is intended to identify sessions that are actually no longer reachable. The layering I’m aiming for is:
So I’ve marked #1028 as depending on #1027. After #1027 lands, I can rebase this PR on top of it so the keep-alive eviction behavior does not conflict with the transport lifecycle cleanup. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1022.
Evict dead Streamable HTTP sessions after repeated keep-alive ping failures.
When
keep-alive-intervalis enabled,KeepAliveSchedulercurrently detects dead sessions by logging failed pings, but the failed sessions remain in the Streamable HTTP provider's session registry. Over time, the scheduler keeps pinging the same unreachable sessions every interval, causing repeated WARN logs and CPU usage that grows with the number of accumulated dead sessions.This PR adds keep-alive success/failure hooks and lets the Streamable HTTP servlet provider evict the logical session after consecutive keep-alive failures.
Dependency / Review Order
Depends on #1027.
#1027 establishes the transport-level lifecycle behavior: servlet async lifecycle events and SSE write failures close only the current HTTP/SSE transport while keeping the logical MCP session registered.
This PR builds on that behavior by evicting the logical MCP session only after repeated keep-alive ping failures. After #1027 lands, this PR should be rebased on top of it so the keep-alive eviction behavior does not conflict with transport-level write-failure cleanup.
Related Issues and PRs
HttpServletStreamableServerTransportProviderremoves session on non-fatal failure #952 and fix: retain streamable HTTP session on response write failure #972: this PR should not be interpreted as deleting the logical session on a single request-specific write failure. That behavior belongs to Close servlet streamable HTTP transports on async lifecycle events #1027 / fix: retain streamable HTTP session on response write failure #972.Changes
KeepAliveScheduler.HttpServletStreamableServerTransportProvider.Rationale
The issue suggested doing the failure counter and session removal directly in
KeepAliveScheduler. This PR intentionally keepsKeepAliveSchedulertransport-agnostic.KeepAliveScheduleronly receives aSupplier<Flux<McpSession>>. It does not own the Streamable HTTP provider's session registry, and it cannot safely remove entries from that registry directly.The Streamable HTTP servlet provider owns the
sessionsmap, so it also owns eviction. The scheduler reports ping success or failure through hooks, and the provider decides whether to reset the failure count, keep the session, or evict it.This PR also avoids evicting on the first failure. A single failed ping can be transient, especially around load balancer timeouts, reconnects, request-specific streams, or temporary transport unavailability. Requiring 3 consecutive failures provides a conservative boundary between transient transport failure and a logical session that is no longer reachable.
Session TTL is intentionally left out of this PR. TTL may still be useful as a separate defense-in-depth policy, but this PR focuses only on sessions that keep-alive has already identified as repeatedly unreachable.
Scope
This PR intentionally focuses on:
KeepAliveSchedulerHttpServletStreamableServerTransportProviderOut of scope:
HttpServletSseServerTransportProviderHow Has This Been Tested?
The added scheduler tests cover:
The added provider tests cover:
Types of changes
Checklist