Skip to content

Commit 8334820

Browse files
committed
fix(core): extend resume retry to the supervisor engine hop
The supervisor-to-engine hop is the one that reaches the continue endpoint, so it is where a transient database outage surfaces as a retryable 5xx. Give its continueRunExecution the same longer, jittered retry budget as the workload client so it can ride out the outage.
1 parent 79da4f8 commit 8334820

1 file changed

Lines changed: 15 additions & 0 deletions

File tree

  • packages/core/src/v3/runEngineWorker/supervisor

packages/core/src/v3/runEngineWorker/supervisor/http.ts

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,21 @@ export class SupervisorHttpClient {
245245
...this.defaultHeaders,
246246
...this.runnerIdHeader(runnerId),
247247
},
248+
},
249+
{
250+
// This is the hop that reaches the engine, so it's where a transient
251+
// database outage during resume surfaces (as a retryable 5xx). Resuming
252+
// is idempotent server-side (guarded by the snapshot id), so retry
253+
// generously to ride out the outage rather than aborting the run.
254+
// `randomize` jitters the delay so a fleet of runs resuming at once
255+
// doesn't stampede the DB the moment it recovers.
256+
retry: {
257+
minTimeoutInMs: 500,
258+
maxTimeoutInMs: 10_000,
259+
maxAttempts: 8,
260+
factor: 2,
261+
randomize: true,
262+
},
248263
}
249264
);
250265
}

0 commit comments

Comments
 (0)