Skip to content

Evict streamable HTTP sessions after failed keep-alive pings#1028

Open
lxq19991111 wants to merge 1 commit into
modelcontextprotocol:mainfrom
lxq19991111:fix/keepalive-dead-session-eviction-clean
Open

Evict streamable HTTP sessions after failed keep-alive pings#1028
lxq19991111 wants to merge 1 commit into
modelcontextprotocol:mainfrom
lxq19991111:fix/keepalive-dead-session-eviction-clean

Conversation

@lxq19991111

@lxq19991111 lxq19991111 commented Jun 14, 2026

Copy link
Copy Markdown

Summary

Fixes #1022.

Evict dead Streamable HTTP sessions after repeated keep-alive ping failures.

When keep-alive-interval is enabled, KeepAliveScheduler currently detects dead sessions by logging failed pings, but the failed sessions remain in the Streamable HTTP provider's session registry. Over time, the scheduler keeps pinging the same unreachable sessions every interval, causing repeated WARN logs and CPU usage that grows with the number of accumulated dead sessions.

This PR adds keep-alive success/failure hooks and lets the Streamable HTTP servlet provider evict the logical session after consecutive keep-alive failures.

Dependency / Review Order

Depends on #1027.

#1027 establishes the transport-level lifecycle behavior: servlet async lifecycle events and SSE write failures close only the current HTTP/SSE transport while keeping the logical MCP session registered.

This PR builds on that behavior by evicting the logical MCP session only after repeated keep-alive ping failures. After #1027 lands, this PR should be rebased on top of it so the keep-alive eviction behavior does not conflict with transport-level write-failure cleanup.

Related Issues and PRs

Changes

  • Add success and failure callbacks to KeepAliveScheduler.
  • Keep the new scheduler callbacks backward compatible by defaulting them to no-op handlers.
  • Track consecutive keep-alive ping failures in HttpServletStreamableServerTransportProvider.
  • Reset the failure count after a successful keep-alive ping.
  • Evict and close a Streamable HTTP session after 3 consecutive keep-alive ping failures.
  • Clear failure counts on session deletion and provider shutdown.
  • Guard failure-count reset and eviction with session identity checks so stale keep-alive results cannot affect a replaced session.
  • Add focused tests for scheduler callbacks and Streamable HTTP keep-alive eviction behavior.

Rationale

The issue suggested doing the failure counter and session removal directly in KeepAliveScheduler. This PR intentionally keeps KeepAliveScheduler transport-agnostic.

KeepAliveScheduler only receives a Supplier<Flux<McpSession>>. It does not own the Streamable HTTP provider's session registry, and it cannot safely remove entries from that registry directly.

The Streamable HTTP servlet provider owns the sessions map, so it also owns eviction. The scheduler reports ping success or failure through hooks, and the provider decides whether to reset the failure count, keep the session, or evict it.

This PR also avoids evicting on the first failure. A single failed ping can be transient, especially around load balancer timeouts, reconnects, request-specific streams, or temporary transport unavailability. Requiring 3 consecutive failures provides a conservative boundary between transient transport failure and a logical session that is no longer reachable.

Session TTL is intentionally left out of this PR. TTL may still be useful as a separate defense-in-depth policy, but this PR focuses only on sessions that keep-alive has already identified as repeatedly unreachable.

Scope

This PR intentionally focuses on:

  • KeepAliveScheduler
  • Streamable HTTP servlet session eviction in HttpServletStreamableServerTransportProvider

Out of scope:

How Has This Been Tested?

mvn -pl mcp-core -Dtest=KeepAliveSchedulerTests,HttpServletStreamableServerTransportProviderTests test
mvn -pl mcp-core test

The added scheduler tests cover:

  • null success/failure callback validation
  • success callback invocation after successful ping
  • failure callback invocation after failed ping
  • callback exceptions do not stop the scheduler

The added provider tests cover:

  • first keep-alive failure does not evict the session
  • repeated keep-alive failures evict and close the session
  • successful keep-alive resets the failure count
  • stale failures from an old session instance do not evict a replacement session with the same id

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

@ShemTovYosef

Copy link
Copy Markdown

Does it take into account this issue too? #972

@lxq19991111

lxq19991111 commented Jun 18, 2026

Copy link
Copy Markdown
Author

Hi @ShemTovYosef, thanks for pointing this out.

Not directly. #972 addresses a transport-level write failure case: a request-specific SSE response write failure should not immediately remove the logical MCP session.

This PR is targeting a different layer: logical session eviction after repeated keep-alive ping failures, which is intended to identify sessions that are actually no longer reachable.

The layering I’m aiming for is:

So I’ve marked #1028 as depending on #1027. After #1027 lands, I can rebase this PR on top of it so the keep-alive eviction behavior does not conflict with the transport lifecycle cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] KeepAliveScheduler does not evict sessions after ping failure, causing unbounded CPU growth

2 participants