fix(core): extend resume retry to the supervisor engine hop

matt-aitken · matt-aitken · commit 83348201733e · 2026-07-05T15:33:39.000+01:00
The supervisor-to-engine hop is the one that reaches the continue endpoint,
so it is where a transient database outage surfaces as a retryable 5xx. Give
its continueRunExecution the same longer, jittered retry budget as the
workload client so it can ride out the outage.
diff --git a/packages/core/src/v3/runEngineWorker/supervisor/http.ts b/packages/core/src/v3/runEngineWorker/supervisor/http.ts
@@ -245,6 +245,21 @@ export class SupervisorHttpClient {
           ...this.defaultHeaders,
           ...this.runnerIdHeader(runnerId),
         },
+      },
+      {
+        // This is the hop that reaches the engine, so it's where a transient
+        // database outage during resume surfaces (as a retryable 5xx). Resuming
+        // is idempotent server-side (guarded by the snapshot id), so retry
+        // generously to ride out the outage rather than aborting the run.
+        // `randomize` jitters the delay so a fleet of runs resuming at once
+        // doesn't stampede the DB the moment it recovers.
+        retry: {
+          minTimeoutInMs: 500,
+          maxTimeoutInMs: 10_000,
+          maxAttempts: 8,
+          factor: 2,
+          randomize: true,
+        },
       }
     );
   }