Skip to content

Mitigate safe-output PR review rate-limit failures#39797

Merged
pelikhan merged 10 commits into
mainfrom
copilot/investigate-403-rate-limit
Jun 17, 2026
Merged

Mitigate safe-output PR review rate-limit failures#39797
pelikhan merged 10 commits into
mainfrom
copilot/investigate-403-rate-limit

Conversation

Copilot AI commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Safe-output review finalization was failing with 403 API rate limit exceeded for installation after buffering review comments. The hot path was spending quota on pre/post review metadata fetches around pulls.createReview, so a brief quota stall could drop the entire review.

  • Review finalization

    • make PR review state capture best-effort for transient/rate-limit failures instead of fatal
    • preserve review submission when transient pulls.get / pulls.listReviews failures occur under rate pressure
    • treat non-transient metadata capture failures as non-retryable errors
    • only attach before_state / after_state metadata when it is actually available
  • Rate-limit recovery

    • wrap pulls.createReview in bounded retry logic for rate-limit responses
    • keep retries intentionally small for this path (one retry) with short delay and small jitter
    • honor GitHub-provided wait signals through the existing retry machinery
  • Quota reduction in degraded mode

    • defer before-state capture until after early exits (missing SHA, staged mode, empty review guard paths)
    • fetch review state in bail-fast sequence (pulls.get first, then pulls.listReviews) to avoid spending an extra call when PR fetch fails
    • skip the follow-up after-state fetch once the flow has already degraded due to metadata capture failure
    • avoid spending another API call on observability when quota is already constrained
  • Tests

    • add retry-exhaustion coverage for createReview rate-limit failures
    • add own-PR fallback coverage where COMMENT retry path is itself rate-limited and retried

Copilot AI and others added 4 commits June 17, 2026 13:45
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title Mitigate safe-output PR review rate limits Mitigate safe-output PR review rate-limit failures Jun 17, 2026
Copilot AI requested a review from pelikhan June 17, 2026 13:55
@pelikhan pelikhan marked this pull request as ready for review June 17, 2026 13:57
Copilot AI review requested due to automatic review settings June 17, 2026 13:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability of safe-output PR review submission under GitHub App installation token rate limiting by making execution-state metadata capture best-effort and adding bounded retry behavior around pulls.createReview, while avoiding additional metadata calls once the flow has degraded.

Changes:

  • Add best-effort “before/after” review state capture that logs warnings and proceeds without metadata when rate-limited or otherwise failing.
  • Wrap pulls.createReview in bounded retry logic (single retry, capped delay) and reuse it across fallback review submission paths.
  • Add targeted tests for degraded-mode state capture and createReview rate-limit retry behavior.
Show a summary per file
File Description
actions/setup/js/pr_review_buffer.cjs Makes review state capture non-fatal, skips after-state fetch in degraded mode, and introduces createReview retry wrapper.
actions/setup/js/pr_review_buffer.test.cjs Adds tests covering rate-limited before-state capture and one-retry behavior for createReview.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 3

Comment on lines 274 to 276
const { repo, repoParts, pullRequestNumber, pullRequest } = reviewContext;
const beforeState = await fetchPullRequestReviewState(github, repoParts, pullRequestNumber);
const beforeState = await fetchReviewStateBestEffort(repoParts, pullRequestNumber, "before");

Comment on lines +40 to +44
const REVIEW_RATE_LIMIT_RETRY_CONFIG = {
...RATE_LIMIT_RETRY_CONFIG,
maxRetries: 1,
maxDelayMs: 60000,
};
Comment on lines +108 to +112
async function fetchReviewStateBestEffort(repoParts, pullRequestNumber, phase) {
try {
return await fetchPullRequestReviewState(github, repoParts, pullRequestNumber);
} catch (error) {
core.warning(`Failed to capture ${phase} PR review state for #${pullRequestNumber}: ${getErrorMessage(error)}. Continuing without execution-state metadata.`);
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #39797 does not have the 'implementation' label (has_implementation_label=false) and has 0 new lines of code in business logic directories (well below the 100-line threshold).

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions github-actions Bot mentioned this pull request Jun 17, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — good targeted fix with solid primary tests; a few coverage gaps and one config-readability note worth addressing.

📋 Key Themes & Highlights

Key Themes

  • Test coverage gaps: retry/degraded-mode tests exist only for the main submission path; the own-PR fallback, body-only fallback, and exhausted-retry paths are untested with the new behaviour
  • Config transparency: REVIEW_RATE_LIMIT_RETRY_CONFIG silently inherits initialDelayMs and jitterMs from the base config; the actual first-retry sleep (~30 s without Retry-After) is not visible at the definition site

Positive Highlights

  • ✅ Root cause correctly addressed — both the metadata-capture path and the createReview call are now rate-limit resilient
  • fetchAfterStateIfAvailable short-circuiting after-state when before-state failed is an elegant quota-conserving design
  • REVIEW_RATE_LIMIT_RETRY_CONFIG.maxRetries: 1 is the right call — no runaway waits that eat into the workflow timeout window
  • attachExecutionState's falsy-guard on beforeState/afterState means null flows through cleanly; the test assertion toBeUndefined() is correct
  • ✅ Both new tests use realistic error shapes (403 + rate-limit headers) matching what the GitHub API actually returns

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer

...RATE_LIMIT_RETRY_CONFIG,
maxRetries: 1,
maxDelayMs: 60000,
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] REVIEW_RATE_LIMIT_RETRY_CONFIG only overrides maxRetries and maxDelayMs — the inherited initialDelayMs: 15000 and jitterMs: 5000 are invisible here, but they determine the actual retry sleep duration.

💡 What this means in practice

Without a retry-after response header, the single retry will sleep:

min(initialDelayMs × backoffMultiplier, maxDelayMs) + jitter
= min(15 000 × 2, 60 000) + rand(0, 4 999)
= 30 000–34 999 ms  (~30–35 s)

With retry-after: 1 (as in the test) it drops to 1 000–5 999 ms — but a live installation-token exhaustion often does not include that header.

This is almost certainly intentional (a 30 s single-stall budget seems fine), but nothing in the block makes that visible. Consider adding comments or explicit overrides:

const REVIEW_RATE_LIMIT_RETRY_CONFIG = {
  ...RATE_LIMIT_RETRY_CONFIG,
  maxRetries: 1,
  maxDelayMs: 60_000,
  // inherits initialDelayMs:15000 → first backoff ~30 s when no Retry-After header
  // inherits jitterMs:5000 → up to 5 s spread
};

} finally {
setTimeoutSpy.mockRestore();
}
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The new retry test covers only the success-after-one-retry happy path. There is no test for the exhausted-retry case (initial call + retry both fail with rate-limit errors), which leaves the caller's error-handling path unverified.

💡 Why this matters

When maxRetries: 1 is exhausted, withRetry rethrows an enhanced error with extra metadata (retryable: true, suggestion: ...). The outer catch (error) block in submitReview calls getErrorMessage(error) and returns { success: false, error: ... }. That path looks correct, but a regression test would lock it in.

Suggested addition (sibling to this it block):

it("should return success:false when createReview rate-limit retry is exhausted", async () => {
  buffer.setReviewMetadata("Looks good", "COMMENT");
  buffer.setReviewContext({ repo: "owner/repo", repoParts: { owner: "owner", repo: "repo" }, pullRequestNumber: 42, pullRequest: { head: { sha: "abc123" } } });

  const rateLimitError = new Error("API rate limit exceeded for installation");
  rateLimitError.response = { status: 403, headers: { "x-ratelimit-remaining": "0", "retry-after": "1" } };
  // Both attempts fail
  mockGithub.rest.pulls.createReview.mockRejectedValue(rateLimitError);

  const result = await buffer.submitReview();
  expect(result.success).toBe(false);
  expect(result.error).toBeDefined();
});

const { data: review } = await github.rest.pulls.createReview(requestParams);
const { data: review } = await createReviewWithRetry(requestParams);
await maybeSupersedeOlderReviews(review.id);
const afterState = await fetchAfterStateIfAvailable();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The own-PR COMMENT fallback (here) and the body-only fallback (line ~626) both call createReviewWithRetry and fetchAfterStateIfAvailable, but neither path has a rate-limit or degraded-state test.

💡 Why this is worth covering

The main submission path now has two regression tests. The fallback paths share the same helpers so they're unlikely to break independently — but they're only reached via specific error conditions, so a rate-limit during one of them would produce an interleaved error-handling flow that hasn't been exercised.

At minimum, one test confirming that a rate-limit on createReview inside the own-PR fallback still returns { success: true } (after retry) or { success: false } (after exhaustion) would round out the story.

@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100 — Excellent

Analyzed 2 new JavaScript test(s) in actions/setup/js/pr_review_buffer.test.cjs: 2 design tests, 0 implementation tests, 0 guideline violations.

📊 Metrics & Test Classification (2 tests analyzed)
Metric Value
New/modified tests analyzed 2
✅ Design tests (behavioral contracts) 2 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 2 (100%)
Duplicate test clusters 0
Test inflation detected No (77 test lines / 47 production lines = 1.64x)
🚨 Coding-guideline violations 0
Test File Classification Issues Detected
should continue when before-state capture is rate-limited actions/setup/js/pr_review_buffer.test.cjs:258 ✅ Design
should retry createReview once on rate-limit errors actions/setup/js/pr_review_buffer.test.cjs:299 ✅ Design

Go: 0 (*_test.go); JavaScript: 2 (*.test.cjs). No other languages detected.

Verdict

Check passed. 0% implementation tests (threshold: 30%). Both new tests verify behavioral contracts — graceful degradation when execution-state capture is rate-limited, and correct retry-with-bounded-delay policy for createReview. Mock targets are all external I/O (github.rest.pulls.*, global.setTimeout) with legitimate justification.

🧪 Test quality analysis by Test Quality Sentinel ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). Both tests cover behavioral contracts (rate-limit graceful degradation and retry-with-bounded-delay policy) with full error-path coverage.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes — four medium-severity issues collectively work against the quota-reduction goal this PR sets out to achieve.

💡 Blocking concerns

1. fetchReviewStateBestEffort catches all exceptions (new — inline comment, line 111)

The catch block is written as a blanket exception handler. Auth failures, wrong-repo targeting, 404s, and code bugs are silently swallowed and logged as warnings. Only rate-limit / transient failures should be suppressed; everything else must propagate so real breakage is visible.

2. fetchPullRequestReviewState fires both API requests concurrently under rate pressure (line 112)

When already rate-limited, the helper still launches both pulls.get and pulls.listReviews in parallel, burning quota on the second call even though the first will fail. A sequential / bail-fast pattern would match the stated quota-reduction goal.

3. REVIEW_RATE_LIMIT_RETRY_CONFIG inherits initialDelayMs: 15000 / jitterMs: 5000 (line 44)

With one retry and no Retry-After header, backoff produces a ~30–35 s sleep. Jitter is also added after maxDelayMs is applied, so the actual sleep can exceed maxDelayMs by up to ~5 s. If the intent is a short bounded wait for quota stalls, both values should be overridden explicitly.

4. beforeState captured before early-exit guards (line 276)

fetchReviewStateBestEffort is called before the staged-mode, empty-review, and missing-SHA guards. Two API calls are spent even when the review will be skipped entirely — the opposite of the PR's quota-reduction intent.

🔎 Code quality review by PR Code Quality Reviewer

async function fetchReviewStateBestEffort(repoParts, pullRequestNumber, phase) {
try {
return await fetchPullRequestReviewState(github, repoParts, pullRequestNumber);
} catch (error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch-all hides real failures: this catches every exception — auth errors, wrong-repo targeting, permission regressions, unexpected API contract changes — and silently treats them all as null state. The PR intent is specifically quota-stall resilience, but the implementation is much wider.

💡 Suggested fix — narrow to rate-limit / transient errors only
} catch (error) {
  // Only suppress transient / rate-limit failures; let real errors propagate.
  if (!isTransientError(error)) throw error;
  core.warning(`Failed to capture ${phase} PR review state for #${pullRequestNumber}: ${getErrorMessage(error)}. Continuing without execution-state metadata.`);
  return null;
}

Import isTransientError from error_recovery.cjs (already handles rate-limit patterns). A broken fetchPullRequestReviewState call signature, wrong credentials, or a 404 should surface as a real error — not disappear into a warning log line.

@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot run pr-finisher skill

@github-actions

Copy link
Copy Markdown
Contributor

@copilot review all comments and summarize the remaining blockers.

Generated by 👨‍🍳 PR Sous Chef ·

Copilot AI commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@copilot run pr-finisher skill

Executed the pr-finisher pass on this branch.

  • Local validation completed successfully (make fmt, make lint, make test-unit, make test, make recompile).
  • Reviewed PR status/threads and current CI snapshot.
  • No additional code changes were required in this pass.

Copilot AI and others added 6 commits June 17, 2026 15:03
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@copilot run pr-finisher skill

Completed. I ran a full pr-finisher pass and pushed the fixes in commit 1f1bedb (plus preceding related commits in this run), including local validation and final review/security validation.

@pelikhan pelikhan merged commit 6cc236a into main Jun 17, 2026
21 checks passed
@pelikhan pelikhan deleted the copilot/investigate-403-rate-limit branch June 17, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants