Skip to content

MCP server connection failures are silently swallowed and not retried #24947

@Emyrk

Description

@Emyrk

Problem

When an MCP server configured in .mcp.json (or via the admin MCP config) fails to connect, the failure is logged at Warn level and the server is skipped. The error is not surfaced to the user in any way — the server's tools simply don't appear. This makes it very difficult to debug why expected tools are missing.

Additionally, failed servers are not retried unless the config file changes on disk (triggering a SnapshotChanged reload). A transient failure (e.g. server not yet ready, network blip) requires a manual config file touch or workspace restart to recover.

Where this happens

Agent-side (agent/x/agentmcp/manager.go)

connectAll() catches the error from connectServer(), logs it, and returns nil from the errgroup:

c, err := m.connectServer(ctx, cfg)
if err != nil {
    m.logger.Warn(ctx, "skipping MCP server",
        slog.F("server", cfg.Name),
        slog.F("transport", cfg.Transport),
        slog.Error(err),
    )
    return nil // Don't fail the group.
}

installServers() will retain a previous client on reconnect failure, but on the first connect there is no previous client — the server is simply absent.

Chatd-side (coderd/x/chatd/mcpclient/mcpclient.go)

ConnectAll() similarly logs and swallows:

if connectErr != nil {
    logger.Warn(ctx,
        "skipping MCP server due to connection failure",
        slog.F("server_slug", cfg.Slug),
        slog.F("server_url", RedactURL(cfg.Url)),
        slog.F("error", redactErrorURL(connectErr)),
    )
    return nil
}

Current API gap

ListMCPToolsResponse only contains Tools []MCPToolInfo — there is no field for failed servers. Callers cannot distinguish "no MCP servers configured" from "configured but all failed to connect."

Desired behavior

  1. Surface failures: Connection errors should be visible to the user somewhere (UI design TBD — could be a chat system message, workspace health indicator, agent status panel, etc.).
  2. Retry failed servers: Servers that fail to connect should be periodically retried rather than requiring a config file change to trigger a reload.

🤖 Generated with Coder Agents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions