Skip to content

feat(session): add configurable fallback model chain#27939

Open
loss-and-quick wants to merge 3 commits into
anomalyco:devfrom
loss-and-quick:feat/fallback-chain
Open

feat(session): add configurable fallback model chain#27939
loss-and-quick wants to merge 3 commits into
anomalyco:devfrom
loss-and-quick:feat/fallback-chain

Conversation

@loss-and-quick
Copy link
Copy Markdown

Issue for this PR

Closes #7602

Related: refactors and extends the approach from #26292.

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Adds a configurable fallback chain: when the primary model returns a retryable error (rate limit, 5xx, overload, quota), the session switches to the next model in the chain instead of failing. The failed model is parked in a process-local cooldown so subsequent turns skip it until it recovers, then traffic flows back to primary on its own.

{
  "model": "anthropic/claude-sonnet-4-20250514",
  "fallbacks": ["openai/gpt-4.1", "deepseek/deepseek-v4"],
  "cooldown_seconds": 300
}

fallbacks can be set at the top level (applies to the main chat agent) or per-agent. cooldown_seconds defaults to 300. Quota errors (weekly / monthly / "exceeded your …") take a hardcoded 6h cooldown instead, and retry-after / retry-after-ms headers are honoured when the provider sends them.

This is a continuation of #26292 with three correctness fixes that came up while running it:

  1. Model attribution actually updates. The original patch communicated "we used a fallback" by mutating the StreamInput object, but llm.ts spreads that object on the way in (run({ ...input, abort })) so the mutation never reached prompt.ts. As a result the DB kept the primary model, wasOnFallback was always false, the "Switched to …" toast fired on every subsequent turn, and the model name under each assistant message didn't change.

    Fix: withFallback publishes a FallbackUsed bus event instead. SessionProcessor.process subscribes for the lifetime of one process call and updates ctx.assistantMessage.modelID/providerID + publishes SessionEvent.Model.Updated. The subscription is released via Effect.ensuring. This also makes the flow work for subsessions and title generation, which both go through the same processor.

  2. Hang after the first fallback. If every model in the chain was on cooldown, pickStart returned a wait decision but the caller proceeded straight into deps.call(primary) without actually waiting — on some providers (notably self-hosted) this hung on a kept-alive socket. Replaced with a real Effect.sleep bounded by WAIT_CAP_MS = 30s, then re-pick the start entry.

  3. Toast spam. The original patch fired both FallbackTriggered and FallbackUsed toasts, and there is also an inline ~> Switching to … notice in the message stream — three notifications for one fallback. Kept only the FallbackTriggered warning toast; dropped the FallbackUsed info toast since the inline notice already shows the same info with attribution to the specific message.

A few smaller things:

  • sync-v2.tsx handler for session.next.model.updated was using draft.find(...) (first assistant in the session) — changed to activeAssistant(draft) ?? [...].reverse().find(...) so it updates the current message.
  • Inline notices come in three kinds (using / switch / resume) with different colours so it's obvious whether we're starting on a fallback because primary is cooling down, switched mid-stream after an error, or returning to primary after recovery.
  • Title fallback uses agent.fallbacks only, not the top-level fallbacks. The top-level list is sized for the main chat model and is typically too expensive for the small title pass; users who want title fallbacks configure them on the title agent.

How did you verify your code works?

  • Running it locally for a few days against an GPT primary + GLM fallback. Forced primary failures by swapping in a bad API key and by overflowing context; observed correct switching, correct model name in the message header after the switch, no toast spam on subsequent turns, and a "Switched back to …" notice once the primary recovered.

Screenshots / recordings

image image image

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Signed-off-by: minicx <minicx@disroot.org>
Signed-off-by: minicx <minicx@disroot.org>
@github-actions
Copy link
Copy Markdown
Contributor

The following comment was made by an LLM, it may be inaccurate:

Based on my search, I found one related PR that should be noted:

Related PR:

Other potentially relevant PRs (not duplicates):

The current PR (#27939) is a continuation/improvement of #26292, not a duplicate. It's the successor that fixes issues found in production use of the original feature.

@loss-and-quick
Copy link
Copy Markdown
Author

@nexxeln, Can you please review?

Fallback notices (e.g. 'Using GLM-4.5-Air while ... is cooling down')
are stored with ignored: true but were not filtered out when converting
assistant parts to model messages, causing them to leak into the LLM
context and be repeated in responses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Native Model Fallback / Failover Support

1 participant