Skip to content

Fix audit comparison output for unsuccessful workflow runs#43139

Open
pelikhan with Copilot wants to merge 6 commits into
mainfrom
copilot/cli-tools-test-audit-fix
Open

Fix audit comparison output for unsuccessful workflow runs#43139
pelikhan with Copilot wants to merge 6 commits into
mainfrom
copilot/cli-tools-test-audit-fix

Conversation

Copilot AI commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Audit reports could classify a failed workflow run as stable and recommend No action needed when its behavioral metrics matched a successful baseline. This made the comparison block contradict the run conclusion and key findings.

  • Comparison classification

    • Feed the current run conclusion into audit comparison generation.
    • Treat any non-success conclusion as a risk signal instead of allowing a stable classification.
    • Preserve existing label vocabulary by classifying unsuccessful runs as risky.
  • Recommendation logic

    • Make recommendations conclusion-aware before applying the generic stable fast path.
    • Replace the no-op recommendation for failed runs with failure-specific guidance.
  • Regression coverage

    • Add a focused unit test for the reported case: a failed run with an otherwise identical successful baseline must not be reported as stable or No action needed.
comparison := buildAuditComparison(
	"failure",
	auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
	baselineRun,
	&auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
)

// classification => risky
// recommendation => investigate failure

Copilot AI and others added 3 commits July 3, 2026 07:06
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix misleading audit report for failed workflow runs Fix audit comparison output for unsuccessful workflow runs Jul 3, 2026
Copilot AI requested a review from pelikhan July 3, 2026 07:13
@pelikhan pelikhan marked this pull request as ready for review July 3, 2026 07:14
Copilot AI review requested due to automatic review settings July 3, 2026 07:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the audit comparison logic used by the gh aw CLI so that workflow-run conclusions influence comparison classification and recommendation text, preventing failed runs from being reported as “stable / No action needed” when they match a successful baseline on behavioral metrics.

Changes:

  • Thread the current run’s conclusion into audit comparison generation and force unsuccessful runs to classify as risky.
  • Make audit comparison recommendations conclusion-aware (e.g., failure-specific guidance).
  • Add a regression unit test ensuring a failed run with an identical successful baseline is not reported as stable/no-op.
Show a summary per file
File Description
pkg/cli/audit_comparison.go Passes run conclusion into comparison, adds “run_unsuccessful” reasoning, and makes classification/recommendation conclusion-aware.
pkg/cli/audit_comparison_test.go Updates existing tests for the new function signature and adds a regression test for failed-run handling.
.github/skills/agentic-workflow-designer/SKILL.md Modifies workflow-designer guidance (safe-output mappings and generation template skeleton).

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

.github/skills/agentic-workflow-designer/SKILL.md:321

  • The generation template removed strict: true, but the repository’s canonical workflow format example includes it and the guidance explicitly says to set strict: true for production workflows (see .github/aw/github-agentic-workflows.md:23 and :53). Keeping it in the skeleton helps prevent newly generated workflows from defaulting to permissive behavior.
```markdown
---
emoji: <emoji>
description: <brief description>
on:
  <trigger config>
permissions:
  contents: read
  issues: read
  pull-requests: read
tools:
  • Files reviewed: 3/3 changed files
  • Comments generated: 2
  • Review effort level: Low

Comment on lines +345 to +346
currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion))
currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success"
Comment on lines 150 to 156
| "start an agent session", "assign to an agent" | `create-agent-session`, `assign-to-agent` |
| "store persistent memory comment" | `comment-memory` |
| "link a sub-issue" | `link-sub-issue` |
| "add labels", "remove labels" | `add-labels`, `remove-labels` |
| "replace a label with another" | `replace-label` |
| "report missing data needed for the task" | `missing-data` (system type, auto-enabled) |
| "report unavailable or missing tool/permission" | `missing-tool` (system type, auto-enabled) |
| "signal task could not be completed due to infrastructure failure" | `report-incomplete` (system type, auto-enabled) |
| "nothing visible", "just analyze" | no safe outputs required |

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

⚠️ PR Code Quality Reviewer failed during code quality review.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #43139 does not have the 'implementation' label and has only 38 new lines of code in business logic directories (threshold: 100).

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🤖 PR Triage

Field Value
Category bug
Risk 🟢 Low
Score 58 / 100
Action fast_track

Score Breakdown: Impact 25 + Urgency 20 + Quality 13

Rationale: Small surgical fix (38+/17−, 3 files) to audit comparison output for unsuccessful runs. Fixes a concrete mismatch bug in audit_comparison.go with co-located test. No CI checks yet; low blast radius.

Recommended Action: Human review → merge once CI passes.

Generated by 🔧 PR Triage Agent · 86.3 AIC · ⌖ 6.42 AIC · ⊞ 5.5K ·

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 93/100 — Excellent

Analyzed 4 test(s): 4 design, 0 implementation, 0 violation(s).

📊 Metrics (4 tests)
Metric Value
Analyzed 4 (Go: 4, JS: 0)
✅ Design 4 (100%)
⚠️ Implementation 0 (0%)
Edge/error coverage 3 (75%)
Duplicate clusters 0
Inflation No (test:prod ≈ 1.1:1)
🚨 Violations 0
Test File Classification Issues
TestBuildAuditComparison_NoBaseline pkg/cli/audit_comparison_test.go:13 design_test / high_value None
TestBuildAuditComparison_RiskyChange pkg/cli/audit_comparison_test.go:22 design_test / high_value None
TestBuildAuditComparison_StableRun pkg/cli/audit_comparison_test.go:56 design_test / low_value Happy-path only
TestBuildAuditComparison_FailedRunIsRisky pkg/cli/audit_comparison_test.go:71 design_test / high_value None

Verdict

Passed. 0% implementation tests (threshold: 30%). No violations.

The new TestBuildAuditComparison_FailedRunIsRisky test is the standout addition: it directly encodes the behavioral invariant that failed runs must be classified as risky (not stable), exercises both positive (risky label, run_unsuccessful reason code) and negative (No action needed absent) assertions, and precisely covers the regression that this PR fixes. The three modified tests are minor signature-update adjustments that preserve prior coverage.

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • awmgmcpg

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "awmgmcpg"

See Network Configuration for more information.

🧪 Test quality analysis by Test Quality Sentinel · 63.4 AIC · ⌖ 19.2 AIC · ⊞ 6.8K ·
Comment /review to run again

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 93/100. 0% implementation tests (threshold: 30%). No violations.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix audit comparison output for unsuccessful workflow runs

The core bug fix is correct and well-tested. Two non-blocking maintainability issues exist in the Go code, and one blocking concern about unrelated SKILL.md deletions.

✅ What looks good

  • Correct fix: Passing currentRun.Run.Conclusion through to buildAuditComparison and giving currentRunUnsuccessful first priority in the label switch unambiguously resolves the reported misclassification.
  • Both call sites updated: Both buildAuditComparisonForProcessedRuns and buildAuditComparisonForRun received the fix (no missed call site).
  • Regression test: TestBuildAuditComparison_FailedRunIsRisky covers the exact reported scenario (metrics-identical failed run vs. successful baseline → risky + Investigate failure).
  • Existing tests preserved: All pre-existing test cases updated minimally to pass the new currentConclusion parameter.

⚠️ Issues requiring changes

  1. run_unsuccessful appended last but evaluated first (inline comment on audit_comparison.go:59) — minor ordering inconsistency that could mislead future maintainers.
  2. recommendAuditComparisonAction re-normalises currentConclusion independently (inline comment on audit_comparison.go:438) — logic duplication that could drift from the normalisation in buildAuditComparison.
  3. Unexplained SKILL.md deletions (inline comment on .github/skills/agentic-workflow-designer/SKILL.md) — the diff silently removes documentation for missing-data/missing-tool/report-incomplete safe-output types and several workflow skeleton fields (strict, max-turns, max-ai-credits, etc.). These changes are unrelated to the PR goal and need explicit justification or reversion. This is the blocking item.

🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 48.2 AIC · ⌖ 6.51 AIC · ⊞ 4.8K

Comments that could not be inline-anchored

pkg/cli/audit_comparison.go:59

Append ordering of run_unsuccessful reason code is surprising.

run_unsuccessful is appended to reasonCodes after all the delta-based codes (line ~68 in the diff), but the label switch evaluates currentRunUnsuccessful first (line ~77). This is functionally correct, but the append order implies run_unsuccessful is a low-priority signal whereas it is actually the highest-priority one.

Appending it immediately after computing currentRunUnsuccessful — before the delta-based codes…

pkg/cli/audit_comparison.go:438

recommendAuditComparisonAction re-checks currentConclusion independently instead of trusting the label already computed.

By the time this function is called, label has already been set to &quot;risky&quot; when currentRunUnsuccessful is true. The function could simply branch on label == &quot;risky&quot; &amp;&amp; currentConclusion != &quot;success&quot; (or accept a dedicated unsuccessful bool param) instead of repeating the normalisation logic. The current duplication means the two checks could drift if the …

.github/skills/agentic-workflow-designer/SKILL.md:13

Undocumented removal of system-type safe-output entries and several workflow skeleton fields.

This diff removes:

  • The missing-data, missing-tool, and report-incomplete safe-output table rows and their explanatory note.
  • strict: true, max-turns, max-ai-credits, max-daily-ai-credits, user-rate-limit, and models fields from the generated workflow skeleton template.

None of these changes are mentioned in the PR description, which is focused entirely on the audit comparison…

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnosing-bugs and /tdd — no blocking issues; three improvement observations posted as inline comments.

📋 Key Themes & Highlights

Positive Highlights

  • ✅ Root cause correctly addressed: buildAuditComparison now propagates currentConclusion end-to-end rather than patching the output after the fact.
  • ✅ Regression test TestBuildAuditComparison_FailedRunIsRisky directly reproduces the reported bug scenario — well done.
  • ✅ Label vocabulary stays consistent; "risky" is reused rather than inventing a new state.
  • ✅ Both call sites (buildAuditComparisonForProcessedRuns and buildAuditComparisonForRun) updated, so no stale path remains.

Improvement Observations

  1. Generic recommendation branch untested"cancelled", "timed_out", and "startup_failure" conclusions reach a fmt.Sprintf branch with no test coverage. See inline comment on line 90 of the test file.
  2. Duplicated guard logicrecommendAuditComparisonAction re-checks currentConclusion != "success" independently of the label = "risky" decision already made in buildAuditComparison. One authoritative path would be safer. See inline comment on line 438 of audit_comparison.go.
  3. Empty-string conclusion is implicitly "success" — this is the right default for in-flight runs but it is undocumented and untested. A short comment or test would make it explicit. See inline comment on line 346 of audit_comparison.go.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 77.8 AIC · ⌖ 7.65 AIC · ⊞ 6.6K
Comment /matt to run again


func TestSelectAuditComparisonBaselinePrefersCohortMatchOverRecency(t *testing.T) {
current := ProcessedRun{
Run: WorkflowRun{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The new test covers only the "failure" conclusion. GitHub Actions also emits "cancelled", "timed_out", and "startup_failure" — those hit the generic fmt.Sprintf branch in recommendAuditComparisonAction and the run_unsuccessful reason-code path, leaving them with zero test coverage.

💡 Suggested additional test
func TestBuildAuditComparison_CancelledRunIsRisky(t *testing.T) {
	baselineRun := &WorkflowRun{DatabaseID: 102, WorkflowName: "triage", Conclusion: "success", CreatedAt: time.Now().Add(-time.Hour)}
	comparison := buildAuditComparison(
		"cancelled",
		auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
		baselineRun,
		&auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
	)

	assert.Equal(t, "risky", comparison.Classification.Label)
	assert.Contains(t, comparison.Classification.ReasonCodes, "run_unsuccessful")
	assert.NotContains(t, comparison.Recommendation.Action, "No action needed")
}

@copilot please address this.


func recommendAuditComparisonAction(label string, delta *AuditComparisonDelta) string {
func recommendAuditComparisonAction(label, currentConclusion string, delta *AuditComparisonDelta) string {
if currentConclusion != "" && currentConclusion != "success" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnosing-bugs] recommendAuditComparisonAction duplicates the currentConclusion != "" && currentConclusion != "success" guard that already lives in the caller (buildAuditComparison sets label = "risky" for unsuccessful runs). The function now has two independent paths to the same outcome — if one diverges in a future edit, they will silently disagree.

Consider relying on the already-computed label == "risky" plus a new "run_unsuccessful" in reasonCodes check, or extract the conclusion check into the existing label dispatch at the call site so the guard exists in exactly one place.

@copilot please address this.


reasonCodes := make([]string, 0, 4)
currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion))
currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnosing-bugs] When currentConclusion is an empty string, currentRunUnsuccessful evaluates to false, so an in-flight run (conclusion not yet set) is silently treated as successful. This is the safe default today, but it is implicit and not tested.

Adding a short comment — // empty conclusion means the run has not yet resolved; treat as success — or an explicit test for the empty-string case would make the intent clear and guard against future changes that might accidentally treat an empty conclusion as a failure signal.

@copilot please address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cli-tools-test] [audit] Comparison block shows "stable" / "No action needed" for failed workflow runs

3 participants