Skip to content

Commit 988f53d

Browse files
committed
fix(core): jittered retry on the resume hops for transient DB outages
Restore the extended, jittered retry on the workload and supervisor continueRunExecution calls so the resume can ride out a transient database outage. Recorded via .server-changes; no package changeset.
1 parent 6c615be commit 988f53d

2 files changed

Lines changed: 29 additions & 0 deletions

File tree

packages/core/src/v3/runEngineWorker/supervisor/http.ts

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,21 @@ export class SupervisorHttpClient {
245245
...this.defaultHeaders,
246246
...this.runnerIdHeader(runnerId),
247247
},
248+
},
249+
{
250+
// This is the hop that reaches the engine, so it's where a transient
251+
// database outage during resume surfaces (as a retryable 5xx). Resuming
252+
// is idempotent server-side (guarded by the snapshot id), so retry
253+
// generously to ride out the outage rather than aborting the run.
254+
// `randomize` jitters the delay so a fleet of runs resuming at once
255+
// doesn't stampede the DB the moment it recovers.
256+
retry: {
257+
minTimeoutInMs: 500,
258+
maxTimeoutInMs: 10_000,
259+
maxAttempts: 8,
260+
factor: 2,
261+
randomize: true,
262+
},
248263
}
249264
);
250265
}

packages/core/src/v3/runEngineWorker/workload/http.ts

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,20 @@ export class WorkloadHttpClient {
132132
headers: {
133133
...this.defaultHeaders(),
134134
},
135+
},
136+
{
137+
// This hop only reaches the supervisor's workload server, so retry
138+
// generously with jittered backoff to ride out a transient blip
139+
// talking to the supervisor (e.g. a restart) rather than aborting the
140+
// run. Database outages surface one hop further in, on the
141+
// supervisor-to-engine call, which carries its own retry for them.
142+
retry: {
143+
minTimeoutInMs: 500,
144+
maxTimeoutInMs: 10_000,
145+
maxAttempts: 8,
146+
factor: 2,
147+
randomize: true,
148+
},
135149
}
136150
)
137151
);

0 commit comments

Comments
 (0)