Skip to content

Feat/mcp resilience4j#1033

Open
Pratyay wants to merge 5 commits into
modelcontextprotocol:mainfrom
Pratyay:feat/mcp-resilience4j
Open

Feat/mcp resilience4j#1033
Pratyay wants to merge 5 commits into
modelcontextprotocol:mainfrom
Pratyay:feat/mcp-resilience4j

Conversation

@Pratyay

@Pratyay Pratyay commented Jun 23, 2026

Copy link
Copy Markdown

Title: Add mcp-resilience4j module with transport-level resilience

Adds a new mcp-resilience4j module that wraps any McpClientTransport with
configurable Resilience4j policies, making MCP tool calls resilient to transient
failures, slow servers, and traffic spikes.

Motivation and Context

MCP tool calls cross a network. Without resilience, a slow or flaky MCP server
can cause cascading failures in AI agent pipelines blocking threads indefinitely,
repeatedly hammering a server that cannot recover, or overwhelming a rate-limited
endpoint during a burst of parallel tool invocations.

McpClientTransport is the natural integration point: it is the single boundary
all MCP clients share, it is the interface frameworks like Google ADK expose for
custom transport injection, and wrapping it leaves the rest of the MCP client
stack entirely unchanged.

How Has This Been Tested

13 unit tests covering:

  • No-op delegation when no policies are configured (all McpClientTransport methods)
  • Retry on transient failure — verifies the delegate is called the correct number of times
  • Retry exhaustion — verifies the last exception is propagated
  • Circuit breaker opens after the sliding window fills with failures; subsequent
    call receives CallNotPermittedException
  • TimeLimiterOperator cancels a slow operation using virtual time (StepVerifier.withVirtualTime)
  • getCircuitBreaker() accessor returns null when not configured and a live instance when configured

All 13 tests pass locally (mvn test -pl mcp-resilience4j).

Breaking Changes

None. This is a new optional module. Existing code and dependencies are unchanged.

Types of Changes

  • New feature (non-breaking change which adds functionality)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional Context

Policy ordering — Retry → CircuitBreaker → RateLimiter → TimeLimiter → Bulkhead
follows the standard Resilience4j recommended hierarchy. Retry is outermost so it
orchestrates the full inner chain per attempt. Bulkhead is innermost so concurrency
slots are released during Retry's backoff sleep rather than held, preventing slot
exhaustion from blocking healthy concurrent callers. RateLimiter is inside Retry so
each retry attempt consumes a token, keeping the local rate count aligned with actual
server-side request volume.

sendMessage() applies all five policies. connect() applies only CircuitBreaker
and Retry, session establishment is not throttled or timed out.

Why not a client-level wrapper? An earlier design explored wrapping McpAsyncClient
directly. This was removed because McpAsyncClient has a package-private constructor
(not subclassable), and frameworks like Google ADK create McpSyncClient internally
with no injection point for a custom async client. The transport is the only hook
these frameworks expose.

ThreadPoolBulkhead is intentionally excluded. The semaphore Bulkhead is correct
for reactive code, injecting a ThreadPoolBulkheadOperator would force a thread-pool
handoff inside the reactive chain, competing with Reactor's own schedulers.

Registry name collisions Resilience4j registries silently return a cached
instance when a name already exists, ignoring any supplied config. The builder logs
a WARN when this is detected. Callers sharing a registry across multiple transports
must use unique names per transport.

Pratyay and others added 5 commits June 23, 2026 14:04
Adds a new mcp-resilience4j module providing circuit breaking, retry,
rate limiting, time limiting, and bulkhead policies for McpClientTransport.

- ResilientMcpClientTransport: decorator wrapping any McpClientTransport
  with all five Resilience4j policies on sendMessage(), CB+Retry only on
  connect(). Policy order follows the standard hierarchy:
  Retry → CircuitBreaker → RateLimiter → TimeLimiter → Bulkhead.
- McpResilienceConfig: high-level fluent facade for configuring the
  transport wrapper via config objects or shared registries.
- 13 unit tests covering delegation, retry, circuit breaker, time limiter,
  and all transparent-delegation methods.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the five policies, their ordering rationale, quick-start
examples for both McpResilienceConfig and the direct builder, registry
usage with the name-collision warning, built-in observability logging,
and a Google ADK integration pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant