Fix audit comparison output for unsuccessful workflow runs by Copilot · Pull Request #43139 · github/gh-aw

Copilot · 2026-07-03T06:39:22Z

Audit reports could classify a failed workflow run as stable and recommend No action needed when its behavioral metrics matched a successful baseline. This made the comparison block contradict the run conclusion and key findings.

Comparison classification
- Feed the current run conclusion into audit comparison generation.
- Treat any non-success conclusion as a risk signal instead of allowing a stable classification.
- Preserve existing label vocabulary by classifying unsuccessful runs as risky.
Recommendation logic
- Make recommendations conclusion-aware before applying the generic stable fast path.
- Replace the no-op recommendation for failed runs with failure-specific guidance.
Regression coverage
- Add a focused unit test for the reported case: a failed run with an otherwise identical successful baseline must not be reported as stable or No action needed.

comparison := buildAuditComparison(
	"failure",
	auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
	baselineRun,
	&auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
)

// classification => risky
// recommendation => investigate failure

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot

Pull request overview

This PR updates the audit comparison logic used by the gh aw CLI so that workflow-run conclusions influence comparison classification and recommendation text, preventing failed runs from being reported as “stable / No action needed” when they match a successful baseline on behavioral metrics.

Changes:

Thread the current run’s conclusion into audit comparison generation and force unsuccessful runs to classify as risky.
Make audit comparison recommendations conclusion-aware (e.g., failure-specific guidance).
Add a regression unit test ensuring a failed run with an identical successful baseline is not reported as stable/no-op.

Show a summary per file

File	Description
pkg/cli/audit_comparison.go	Passes run conclusion into comparison, adds “run_unsuccessful” reasoning, and makes classification/recommendation conclusion-aware.
pkg/cli/audit_comparison_test.go	Updates existing tests for the new function signature and adds a regression test for failed-run handling.
.github/skills/agentic-workflow-designer/SKILL.md	Modifies workflow-designer guidance (safe-output mappings and generation template skeleton).

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

.github/skills/agentic-workflow-designer/SKILL.md:321

The generation template removed strict: true, but the repository’s canonical workflow format example includes it and the guidance explicitly says to set strict: true for production workflows (see .github/aw/github-agentic-workflows.md:23 and :53). Keeping it in the skeleton helps prevent newly generated workflows from defaulting to permissive behavior.

```markdown
---
emoji: <emoji>
description: <brief description>
on:
  <trigger config>
permissions:
  contents: read
  issues: read
  pull-requests: read
tools:

Files reviewed: 3/3 changed files
Comments generated: 2
Review effort level: Low

+	currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion))
+	currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success"


 | "start an agent session", "assign to an agent" | `create-agent-session`, `assign-to-agent` |
 | "store persistent memory comment" | `comment-memory` |
 | "link a sub-issue" | `link-sub-issue` |
 | "add labels", "remove labels" | `add-labels`, `remove-labels` |
 | "replace a label with another" | `replace-label` |
-| "report missing data needed for the task" | `missing-data` (system type, auto-enabled) |
-| "report unavailable or missing tool/permission" | `missing-tool` (system type, auto-enabled) |
-| "signal task could not be completed due to infrastructure failure" | `report-incomplete` (system type, auto-enabled) |
 | "nothing visible", "just analyze" | no safe outputs required |



github-actions · 2026-07-03T07:22:06Z

⚠️ PR Code Quality Reviewer failed during code quality review.

github-actions · 2026-07-03T07:22:12Z

✅ Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #43139 does not have the 'implementation' label and has only 38 new lines of code in business logic directories (threshold: 100).

github-actions · 2026-07-03T07:22:12Z

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

github-actions · 2026-07-03T07:22:15Z

✅ Test Quality Sentinel completed test quality analysis.

github-actions · 2026-07-03T07:31:14Z

🤖 PR Triage

Field	Value
Category	`bug`
Risk	🟢 Low
Score	58 / 100
Action	`fast_track`

Score Breakdown: Impact 25 + Urgency 20 + Quality 13

Rationale: Small surgical fix (38+/17−, 3 files) to audit comparison output for unsuccessful runs. Fixes a concrete mismatch bug in audit_comparison.go with co-located test. No CI checks yet; low blast radius.

Recommended Action: Human review → merge once CI passes.

Generated by 🔧 PR Triage Agent · 86.3 AIC · ⌖ 6.42 AIC · ⊞ 5.5K · ◷

github-actions · 2026-07-03T07:32:25Z

🧪 Test Quality Sentinel Report

✅ Test Quality Score: 93/100 — Excellent

Analyzed 4 test(s): 4 design, 0 implementation, 0 violation(s).

📊 Metrics (4 tests)

Metric	Value
Analyzed	4 (Go: 4, JS: 0)
✅ Design	4 (100%)
⚠️ Implementation	0 (0%)
Edge/error coverage	3 (75%)
Duplicate clusters	0
Inflation	No (test:prod ≈ 1.1:1)
🚨 Violations	0

Test	File	Classification	Issues
`TestBuildAuditComparison_NoBaseline`	`pkg/cli/audit_comparison_test.go:13`	design_test / high_value	None
`TestBuildAuditComparison_RiskyChange`	`pkg/cli/audit_comparison_test.go:22`	design_test / high_value	None
`TestBuildAuditComparison_StableRun`	`pkg/cli/audit_comparison_test.go:56`	design_test / low_value	Happy-path only
`TestBuildAuditComparison_FailedRunIsRisky`	`pkg/cli/audit_comparison_test.go:71`	design_test / high_value	None

Verdict

✅ Passed. 0% implementation tests (threshold: 30%). No violations.

The new TestBuildAuditComparison_FailedRunIsRisky test is the standout addition: it directly encodes the behavioral invariant that failed runs must be classified as risky (not stable), exercises both positive (risky label, run_unsuccessful reason code) and negative (No action needed absent) assertions, and precisely covers the regression that this PR fixes. The three modified tests are minor signature-update adjustments that preserve prior coverage.

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

awmgmcpg

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "awmgmcpg"

See Network Configuration for more information.

🧪 Test quality analysis by Test Quality Sentinel · 63.4 AIC · ⌖ 19.2 AIC · ⊞ 6.8K · ◷
_{Comment /review to run again}

github-actions

✅ Test Quality Sentinel: 93/100. 0% implementation tests (threshold: 30%). No violations.

github-actions

Review: Fix audit comparison output for unsuccessful workflow runs

The core bug fix is correct and well-tested. Two non-blocking maintainability issues exist in the Go code, and one blocking concern about unrelated SKILL.md deletions.

✅ What looks good

Correct fix: Passing currentRun.Run.Conclusion through to buildAuditComparison and giving currentRunUnsuccessful first priority in the label switch unambiguously resolves the reported misclassification.
Both call sites updated: Both buildAuditComparisonForProcessedRuns and buildAuditComparisonForRun received the fix (no missed call site).
Regression test: TestBuildAuditComparison_FailedRunIsRisky covers the exact reported scenario (metrics-identical failed run vs. successful baseline → risky + Investigate failure).
Existing tests preserved: All pre-existing test cases updated minimally to pass the new currentConclusion parameter.

⚠️ Issues requiring changes

run_unsuccessful appended last but evaluated first (inline comment on audit_comparison.go:59) — minor ordering inconsistency that could mislead future maintainers.
recommendAuditComparisonAction re-normalises currentConclusion independently (inline comment on audit_comparison.go:438) — logic duplication that could drift from the normalisation in buildAuditComparison.
Unexplained SKILL.md deletions (inline comment on .github/skills/agentic-workflow-designer/SKILL.md) — the diff silently removes documentation for missing-data/missing-tool/report-incomplete safe-output types and several workflow skeleton fields (strict, max-turns, max-ai-credits, etc.). These changes are unrelated to the PR goal and need explicit justification or reversion. This is the blocking item.

🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 48.2 AIC · ⌖ 6.51 AIC · ⊞ 4.8K

Comments that could not be inline-anchored

pkg/cli/audit_comparison.go:59

Append ordering of run_unsuccessful reason code is surprising.

run_unsuccessful is appended to reasonCodes after all the delta-based codes (line ~68 in the diff), but the label switch evaluates currentRunUnsuccessful first (line ~77). This is functionally correct, but the append order implies run_unsuccessful is a low-priority signal whereas it is actually the highest-priority one.

Appending it immediately after computing currentRunUnsuccessful — before the delta-based codes…

pkg/cli/audit_comparison.go:438

recommendAuditComparisonAction re-checks currentConclusion independently instead of trusting the label already computed.

By the time this function is called, label has already been set to "risky" when currentRunUnsuccessful is true. The function could simply branch on label == "risky" && currentConclusion != "success" (or accept a dedicated unsuccessful bool param) instead of repeating the normalisation logic. The current duplication means the two checks could drift if the …

.github/skills/agentic-workflow-designer/SKILL.md:13

Undocumented removal of system-type safe-output entries and several workflow skeleton fields.

This diff removes:

The missing-data, missing-tool, and report-incomplete safe-output table rows and their explanatory note.
strict: true, max-turns, max-ai-credits, max-daily-ai-credits, user-rate-limit, and models fields from the generated workflow skeleton template.

None of these changes are mentioned in the PR description, which is focused entirely on the audit comparison…

github-actions

Skills-Based Review 🧠

Applied /diagnosing-bugs and /tdd — no blocking issues; three improvement observations posted as inline comments.

📋 Key Themes & Highlights

Positive Highlights

✅ Root cause correctly addressed: buildAuditComparison now propagates currentConclusion end-to-end rather than patching the output after the fact.
✅ Regression test TestBuildAuditComparison_FailedRunIsRisky directly reproduces the reported bug scenario — well done.
✅ Label vocabulary stays consistent; "risky" is reused rather than inventing a new state.
✅ Both call sites (buildAuditComparisonForProcessedRuns and buildAuditComparisonForRun) updated, so no stale path remains.

Improvement Observations

Generic recommendation branch untested — "cancelled", "timed_out", and "startup_failure" conclusions reach a fmt.Sprintf branch with no test coverage. See inline comment on line 90 of the test file.
Duplicated guard logic — recommendAuditComparisonAction re-checks currentConclusion != "success" independently of the label = "risky" decision already made in buildAuditComparison. One authoritative path would be safer. See inline comment on line 438 of audit_comparison.go.
Empty-string conclusion is implicitly "success" — this is the right default for in-flight runs but it is undocumented and untested. A short comment or test would make it explicit. See inline comment on line 346 of audit_comparison.go.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 77.8 AIC · ⌖ 7.65 AIC · ⊞ 6.6K
_{Comment /matt to run again}

github-actions · 2026-07-03T07:36:13Z

+
 func TestSelectAuditComparisonBaselinePrefersCohortMatchOverRecency(t *testing.T) {
 	current := ProcessedRun{
 		Run: WorkflowRun{


[/tdd] The new test covers only the "failure" conclusion. GitHub Actions also emits "cancelled", "timed_out", and "startup_failure" — those hit the generic fmt.Sprintf branch in recommendAuditComparisonAction and the run_unsuccessful reason-code path, leaving them with zero test coverage.

💡 Suggested additional test

func TestBuildAuditComparison_CancelledRunIsRisky(t *testing.T) { baselineRun := &WorkflowRun{DatabaseID: 102, WorkflowName: "triage", Conclusion: "success", CreatedAt: time.Now().Add(-time.Hour)} comparison := buildAuditComparison( "cancelled", auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0}, baselineRun, &auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0}, ) assert.Equal(t, "risky", comparison.Classification.Label) assert.Contains(t, comparison.Classification.ReasonCodes, "run_unsuccessful") assert.NotContains(t, comparison.Recommendation.Action, "No action needed") }

@copilot please address this.

github-actions · 2026-07-03T07:36:13Z


-func recommendAuditComparisonAction(label string, delta *AuditComparisonDelta) string {
+func recommendAuditComparisonAction(label, currentConclusion string, delta *AuditComparisonDelta) string {
+	if currentConclusion != "" && currentConclusion != "success" {


[/diagnosing-bugs] recommendAuditComparisonAction duplicates the currentConclusion != "" && currentConclusion != "success" guard that already lives in the caller (buildAuditComparison sets label = "risky" for unsuccessful runs). The function now has two independent paths to the same outcome — if one diverges in a future edit, they will silently disagree.

Consider relying on the already-computed label == "risky" plus a new "run_unsuccessful" in reasonCodes check, or extract the conclusion check into the existing label dispatch at the call site so the guard exists in exactly one place.

@copilot please address this.

github-actions · 2026-07-03T07:36:13Z


 	reasonCodes := make([]string, 0, 4)
+	currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion))
+	currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success"


[/diagnosing-bugs] When currentConclusion is an empty string, currentRunUnsuccessful evaluates to false, so an in-flight run (conclusion not yet set) is silently treated as successful. This is the safe default today, but it is implicit and not tested.

Adding a short comment — // empty conclusion means the run has not yet resolved; treat as success — or an explicit test for the empty-string case would make the intent clear and guard against future changes that might accidentally treat an empty conclusion as a failure signal.

@copilot please address this.

done

Initial plan

fb63dbd

Copilot AI assigned Copilot and pelikhan Jul 3, 2026

Copilot AI linked an issue Jul 3, 2026 that may be closed by this pull request

[cli-tools-test] [audit] Comparison block shows "stable" / "No action needed" for failed workflow runs #43137

Open

Copilot started work on behalf of pelikhan July 3, 2026 06:59 View session

Copilot AI and others added 3 commits July 3, 2026 07:06

Plan audit comparison failure fix

8697468

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Fix failed audit comparison classification

0a29ecb

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Harden audit failure recommendations

b4907bf

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix misleading audit report for failed workflow runs~~ Fix audit comparison output for unsuccessful workflow runs Jul 3, 2026

Copilot AI requested a review from pelikhan July 3, 2026 07:13

Copilot finished work on behalf of pelikhan July 3, 2026 07:13

pelikhan marked this pull request as ready for review July 3, 2026 07:14

Copilot AI review requested due to automatic review settings July 3, 2026 07:14

Copilot started reviewing on behalf of pelikhan July 3, 2026 07:14 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jul 3, 2026

[aw] No-Op Runs #39849

Open

github-actions Bot added pr-action:fast_track pr-agent:copilot-swe-agent pr-priority:medium pr-risk:low pr-type:bug labels Jul 3, 2026

github-actions Bot mentioned this pull request Jul 3, 2026

[PR Triage Report] 🤖 PR Triage Report — 2026-07-03 (Run §28644965236) #43147

Closed

github-actions Bot approved these changes Jul 3, 2026

View reviewed changes

github-actions Bot previously requested changes Jul 3, 2026

View reviewed changes

github-actions Bot reviewed Jul 3, 2026

View reviewed changes

This was referenced Jul 3, 2026

[PR Triage Report] 🤖 PR Triage Report — 2026-07-03 Run #28661615516 #43190

Closed

[PR Triage Report] PR Triage Report — 2026-07-03 (Run #28677894072) #43257

Closed

[PR Triage Report] PR Triage Report — 2026-07-04 (Run #28689979672) #43275

Open

pelikhan added 2 commits July 3, 2026 20:31

Merge branch 'main' into copilot/cli-tools-test-audit-fix

eaf28fd

Merge branch 'main' into copilot/cli-tools-test-audit-fix

9f649f0

		currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion))
		currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success"

Uh oh!

Conversation

Copilot AI commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026

🤖 PR Triage

Uh oh!

github-actions Bot commented Jul 3, 2026

🧪 Test Quality Sentinel Report

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Review: Fix audit comparison output for unsuccessful workflow runs

✅ What looks good

⚠️ Issues requiring changes

Comments that could not be inline-anchored

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Skills-Based Review 🧠

Positive Highlights

Improvement Observations

Uh oh!

github-actions Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading