Fix audit comparison output for unsuccessful workflow runs#43139
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the audit comparison logic used by the gh aw CLI so that workflow-run conclusions influence comparison classification and recommendation text, preventing failed runs from being reported as “stable / No action needed” when they match a successful baseline on behavioral metrics.
Changes:
- Thread the current run’s
conclusioninto audit comparison generation and force unsuccessful runs to classify asrisky. - Make audit comparison recommendations conclusion-aware (e.g., failure-specific guidance).
- Add a regression unit test ensuring a failed run with an identical successful baseline is not reported as stable/no-op.
Show a summary per file
| File | Description |
|---|---|
| pkg/cli/audit_comparison.go | Passes run conclusion into comparison, adds “run_unsuccessful” reasoning, and makes classification/recommendation conclusion-aware. |
| pkg/cli/audit_comparison_test.go | Updates existing tests for the new function signature and adds a regression test for failed-run handling. |
| .github/skills/agentic-workflow-designer/SKILL.md | Modifies workflow-designer guidance (safe-output mappings and generation template skeleton). |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
.github/skills/agentic-workflow-designer/SKILL.md:321
- The generation template removed
strict: true, but the repository’s canonical workflow format example includes it and the guidance explicitly says to setstrict: truefor production workflows (see.github/aw/github-agentic-workflows.md:23and :53). Keeping it in the skeleton helps prevent newly generated workflows from defaulting to permissive behavior.
```markdown
---
emoji: <emoji>
description: <brief description>
on:
<trigger config>
permissions:
contents: read
issues: read
pull-requests: read
tools:
- Files reviewed: 3/3 changed files
- Comments generated: 2
- Review effort level: Low
| currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion)) | ||
| currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success" |
| | "start an agent session", "assign to an agent" | `create-agent-session`, `assign-to-agent` | | ||
| | "store persistent memory comment" | `comment-memory` | | ||
| | "link a sub-issue" | `link-sub-issue` | | ||
| | "add labels", "remove labels" | `add-labels`, `remove-labels` | | ||
| | "replace a label with another" | `replace-label` | | ||
| | "report missing data needed for the task" | `missing-data` (system type, auto-enabled) | | ||
| | "report unavailable or missing tool/permission" | `missing-tool` (system type, auto-enabled) | | ||
| | "signal task could not be completed due to infrastructure failure" | `report-incomplete` (system type, auto-enabled) | | ||
| | "nothing visible", "just analyze" | no safe outputs required | | ||
|
|
|
|
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #43139 does not have the 'implementation' label and has only 38 new lines of code in business logic directories (threshold: 100). |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Test Quality Sentinel completed test quality analysis. |
🤖 PR Triage
Score Breakdown: Impact 25 + Urgency 20 + Quality 13 Rationale: Small surgical fix (38+/17−, 3 files) to audit comparison output for unsuccessful runs. Fixes a concrete mismatch bug in Recommended Action: Human review → merge once CI passes.
|
🧪 Test Quality Sentinel Report✅ Test Quality Score: 93/100 — Excellent
📊 Metrics (4 tests)
Verdict
The new Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "awmgmcpg"See Network Configuration for more information.
|
There was a problem hiding this comment.
Review: Fix audit comparison output for unsuccessful workflow runs
The core bug fix is correct and well-tested. Two non-blocking maintainability issues exist in the Go code, and one blocking concern about unrelated SKILL.md deletions.
✅ What looks good
- Correct fix: Passing
currentRun.Run.Conclusionthrough tobuildAuditComparisonand givingcurrentRunUnsuccessfulfirst priority in the label switch unambiguously resolves the reported misclassification. - Both call sites updated: Both
buildAuditComparisonForProcessedRunsandbuildAuditComparisonForRunreceived the fix (no missed call site). - Regression test:
TestBuildAuditComparison_FailedRunIsRiskycovers the exact reported scenario (metrics-identical failed run vs. successful baseline →risky+Investigate failure). - Existing tests preserved: All pre-existing test cases updated minimally to pass the new
currentConclusionparameter.
⚠️ Issues requiring changes
run_unsuccessfulappended last but evaluated first (inline comment onaudit_comparison.go:59) — minor ordering inconsistency that could mislead future maintainers.recommendAuditComparisonActionre-normalisescurrentConclusionindependently (inline comment onaudit_comparison.go:438) — logic duplication that could drift from the normalisation inbuildAuditComparison.- Unexplained SKILL.md deletions (inline comment on
.github/skills/agentic-workflow-designer/SKILL.md) — the diff silently removes documentation formissing-data/missing-tool/report-incompletesafe-output types and several workflow skeleton fields (strict,max-turns,max-ai-credits, etc.). These changes are unrelated to the PR goal and need explicit justification or reversion. This is the blocking item.
🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 48.2 AIC · ⌖ 6.51 AIC · ⊞ 4.8K
Comments that could not be inline-anchored
pkg/cli/audit_comparison.go:59
Append ordering of run_unsuccessful reason code is surprising.
run_unsuccessful is appended to reasonCodes after all the delta-based codes (line ~68 in the diff), but the label switch evaluates currentRunUnsuccessful first (line ~77). This is functionally correct, but the append order implies run_unsuccessful is a low-priority signal whereas it is actually the highest-priority one.
Appending it immediately after computing currentRunUnsuccessful — before the delta-based codes…
pkg/cli/audit_comparison.go:438
recommendAuditComparisonAction re-checks currentConclusion independently instead of trusting the label already computed.
By the time this function is called, label has already been set to "risky" when currentRunUnsuccessful is true. The function could simply branch on label == "risky" && currentConclusion != "success" (or accept a dedicated unsuccessful bool param) instead of repeating the normalisation logic. The current duplication means the two checks could drift if the …
.github/skills/agentic-workflow-designer/SKILL.md:13
Undocumented removal of system-type safe-output entries and several workflow skeleton fields.
This diff removes:
- The
missing-data,missing-tool, andreport-incompletesafe-output table rows and their explanatory note. strict: true,max-turns,max-ai-credits,max-daily-ai-credits,user-rate-limit, andmodelsfields from the generated workflow skeleton template.
None of these changes are mentioned in the PR description, which is focused entirely on the audit comparison…
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnosing-bugs and /tdd — no blocking issues; three improvement observations posted as inline comments.
📋 Key Themes & Highlights
Positive Highlights
- ✅ Root cause correctly addressed:
buildAuditComparisonnow propagatescurrentConclusionend-to-end rather than patching the output after the fact. - ✅ Regression test
TestBuildAuditComparison_FailedRunIsRiskydirectly reproduces the reported bug scenario — well done. - ✅ Label vocabulary stays consistent;
"risky"is reused rather than inventing a new state. - ✅ Both call sites (
buildAuditComparisonForProcessedRunsandbuildAuditComparisonForRun) updated, so no stale path remains.
Improvement Observations
- Generic recommendation branch untested —
"cancelled","timed_out", and"startup_failure"conclusions reach afmt.Sprintfbranch with no test coverage. See inline comment on line 90 of the test file. - Duplicated guard logic —
recommendAuditComparisonActionre-checkscurrentConclusion != "success"independently of thelabel = "risky"decision already made inbuildAuditComparison. One authoritative path would be safer. See inline comment on line 438 ofaudit_comparison.go. - Empty-string conclusion is implicitly "success" — this is the right default for in-flight runs but it is undocumented and untested. A short comment or test would make it explicit. See inline comment on line 346 of
audit_comparison.go.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 77.8 AIC · ⌖ 7.65 AIC · ⊞ 6.6K
Comment /matt to run again
|
|
||
| func TestSelectAuditComparisonBaselinePrefersCohortMatchOverRecency(t *testing.T) { | ||
| current := ProcessedRun{ | ||
| Run: WorkflowRun{ |
There was a problem hiding this comment.
[/tdd] The new test covers only the "failure" conclusion. GitHub Actions also emits "cancelled", "timed_out", and "startup_failure" — those hit the generic fmt.Sprintf branch in recommendAuditComparisonAction and the run_unsuccessful reason-code path, leaving them with zero test coverage.
💡 Suggested additional test
func TestBuildAuditComparison_CancelledRunIsRisky(t *testing.T) {
baselineRun := &WorkflowRun{DatabaseID: 102, WorkflowName: "triage", Conclusion: "success", CreatedAt: time.Now().Add(-time.Hour)}
comparison := buildAuditComparison(
"cancelled",
auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
baselineRun,
&auditComparisonSnapshot{Turns: 4, Posture: "read_only", BlockedRequests: 0},
)
assert.Equal(t, "risky", comparison.Classification.Label)
assert.Contains(t, comparison.Classification.ReasonCodes, "run_unsuccessful")
assert.NotContains(t, comparison.Recommendation.Action, "No action needed")
}@copilot please address this.
|
|
||
| func recommendAuditComparisonAction(label string, delta *AuditComparisonDelta) string { | ||
| func recommendAuditComparisonAction(label, currentConclusion string, delta *AuditComparisonDelta) string { | ||
| if currentConclusion != "" && currentConclusion != "success" { |
There was a problem hiding this comment.
[/diagnosing-bugs] recommendAuditComparisonAction duplicates the currentConclusion != "" && currentConclusion != "success" guard that already lives in the caller (buildAuditComparison sets label = "risky" for unsuccessful runs). The function now has two independent paths to the same outcome — if one diverges in a future edit, they will silently disagree.
Consider relying on the already-computed label == "risky" plus a new "run_unsuccessful" in reasonCodes check, or extract the conclusion check into the existing label dispatch at the call site so the guard exists in exactly one place.
@copilot please address this.
|
|
||
| reasonCodes := make([]string, 0, 4) | ||
| currentConclusion = strings.TrimSpace(strings.ToLower(currentConclusion)) | ||
| currentRunUnsuccessful := currentConclusion != "" && currentConclusion != "success" |
There was a problem hiding this comment.
[/diagnosing-bugs] When currentConclusion is an empty string, currentRunUnsuccessful evaluates to false, so an in-flight run (conclusion not yet set) is silently treated as successful. This is the safe default today, but it is implicit and not tested.
Adding a short comment — // empty conclusion means the run has not yet resolved; treat as success — or an explicit test for the empty-string case would make the intent clear and guard against future changes that might accidentally treat an empty conclusion as a failure signal.
@copilot please address this.
Audit reports could classify a failed workflow run as
stableand recommendNo action neededwhen its behavioral metrics matched a successful baseline. This made the comparison block contradict the run conclusion and key findings.Comparison classification
successconclusion as a risk signal instead of allowing astableclassification.risky.Recommendation logic
stablefast path.Regression coverage
stableorNo action needed.