Skip to content

Inspect all completed forecast runs in gh aw forecast#39102

Merged
pelikhan merged 1 commit into
mainfrom
copilot/review-forecast-command
Jun 13, 2026
Merged

Inspect all completed forecast runs in gh aw forecast#39102
pelikhan merged 1 commit into
mainfrom
copilot/review-forecast-command

Conversation

Copilot AI commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

gh aw forecast was sampling only successful runs, which undercounted AI credit usage for workflows where failed jobs still consumed credits. This change broadens sampling to all completed runs while continuing to exclude skipped runs from forecast inputs.

  • Behavior change

    • Request completed workflow runs instead of success runs when building forecast samples.
    • Apply the same completed-run filter in evaluation/backtesting so forecast inputs and validation observe the same run population.
  • Skipped-run handling

    • Preserve the existing guard that excludes runs with conclusion == "skipped" from sampling and metric computation.
    • This keeps no-op executions out of AIC and success-rate estimates.
  • Test coverage

    • Update the forecast unit test to assert that the command now queries completed runs rather than only successful runs.
opts := ListWorkflowRunsOptions{
    WorkflowName: apiName,
    Status:       "completed",
    StartDate:    startDate,
    Limit:        config.SampleSize,
}

func isCompletedNonSkippedRun(r WorkflowRun) bool {
    return r.Status == "completed" && r.Conclusion != "skipped"
}

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title Inspect completed runs in forecast command Inspect all completed forecast runs in gh aw forecast Jun 13, 2026
Copilot AI requested a review from pelikhan June 13, 2026 15:46
@pelikhan pelikhan marked this pull request as ready for review June 13, 2026 15:59
Copilot AI review requested due to automatic review settings June 13, 2026 15:59
@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel completed test quality analysis.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates gh aw forecast to sample all completed workflow runs (instead of only successful ones) so AI credit usage estimates include runs that consume credits even when jobs fail, while still excluding skipped runs from sampling and evaluation.

Changes:

  • Update forecast sampling to request workflow runs with Status: "completed" instead of Status: "success".
  • Apply the same completed-run query behavior in forecast evaluation/backtesting to keep training and validation populations aligned.
  • Update the unit test to assert the new "completed" status query.
Show a summary per file
File Description
pkg/cli/forecast.go Switches workflow run listing filters from success to completed for both forecasting and evaluation.
pkg/cli/forecast_test.go Renames/updates the test to verify the CLI now requests completed runs.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #39102 does not have the implementation label (has_implementation_label=false) and has only 4 new lines in business logic directories, well below the 100-line threshold (requires_adr_by_default_volume=false).

@pelikhan pelikhan merged commit 3fa56b1 into main Jun 13, 2026
68 of 80 checks passed
@pelikhan pelikhan deleted the copilot/review-forecast-command branch June 13, 2026 16:03
@github-actions github-actions Bot mentioned this pull request Jun 13, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — overall the fix is correct and well-scoped; two suggestions to strengthen it.

📋 Key Themes & Highlights

Key Themes

  • Test coverage gap: the evaluateForecast backtesting path uses listWorkflowRunsWithPagination directly (not the injectable forecastListWorkflowRunsPaginated), so the "success" → "completed" fix there has no unit test regression guard.
  • Testability asymmetry: forecastWorkflow and evaluateForecast follow different injection patterns, making symmetric test coverage hard without a small refactor.

Positive Highlights

  • ✅ Root cause correctly identified and fixed in both paths (forecastWorkflow and evaluateForecast), keeping sampling and backtesting consistent.
  • ✅ Existing isCompletedNonSkippedRun guard preserved — skipped runs remain excluded from AIC and success-rate estimates.
  • ✅ Test renamed and assertion updated to match the new behaviour — clean, minimal diff.
  • ✅ PR description is precise about what changed and why.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 226.1 AIC · ⌖ 13.5 AIC · ⊞ 27.9K

Comment thread pkg/cli/forecast.go
opts := ListWorkflowRunsOptions{
WorkflowName: apiName,
Status: "success",
Status: "completed",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] This fix is correct, but the evaluateForecast path has no unit test regression coverage.

Unlike forecastWorkflow, which calls listRunsWithBackoffforecastListWorkflowRunsPaginated (the injectable var at line 50), evaluateForecast calls listWorkflowRunsWithPagination directly on line 1126. That direct call means no mock can capture capturedOpts.Status here, so the "success" → "completed" fix in this function cannot be pinned with a unit test today.

💡 Suggestion: introduce an injectable variable for `evaluateForecast`

Add a package-level var mirroring the existing pattern:

// near top of file, next to forecastListWorkflowRunsPaginated
var forecastEvalListWorkflowRuns = listWorkflowRunsWithPagination

Then replace the direct call on line 1126:

// before
runs, _, err := listWorkflowRunsWithPagination(opts)
// after
runs, _, err := forecastEvalListWorkflowRuns(opts)

This lets a test capture opts.Status on the validation path, closing the coverage gap symmetrically with TestForecastWorkflow_RequestsCompletedRuns.

Comment thread pkg/cli/forecast_test.go
}, 30)
require.NoError(t, err)
assert.Equal(t, "success", capturedOpts.Status)
assert.Equal(t, "completed", capturedOpts.Status)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] This test covers the forecastWorkflow sampling path — consider adding a parallel test for evaluateForecast to lock in the fix on the backtesting path too.

💡 Sketch of a parallel test

Once forecastEvalListWorkflowRuns (or equivalent) is made injectable:

func TestEvaluateForecast_RequestsCompletedRuns(t *testing.T) {
    original := forecastEvalListWorkflowRuns
    t.Cleanup(func() { forecastEvalListWorkflowRuns = original })

    var capturedOpts ListWorkflowRunsOptions
    forecastEvalListWorkflowRuns = func(opts ListWorkflowRunsOptions) ([]WorkflowRun, int, error) {
        capturedOpts = opts
        return nil, 0, nil
    }

    evaluateForecast(context.Background(), "smoke-copilot", ForecastWorkflowResult{}, "2026-01-01", "2026-02-01", ForecastConfig{})
    assert.Equal(t, "completed", capturedOpts.Status)
}

This mirrors the existing TestForecastWorkflow_RequestsCompletedRuns and ensures both sampling paths stay in sync.

@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

⚠️ Test Quality Score: 70/100 — Acceptable, with suggestions

Analyzed 1 test (modified): 1 design test, 0 implementation tests, 0 guideline violations. The test correctly encodes the behavioral contract of the "completed" status filter change.

📊 Metrics & Test Classification (1 test analyzed)
Metric Value
New/modified tests analyzed 1
✅ Design tests (behavioral contracts) 1 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 0 (0%)
Duplicate test clusters 0
Test inflation detected No
🚨 Coding-guideline violations 0

Test Classification Details

Test File Classification Issues Detected
TestForecastWorkflow_RequestsCompletedRuns pkg/cli/forecast_test.go:353 ✅ Design Missing assertion messages; no error-path coverage

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 1 test — unit (//go:build !integration)
⚠️ Flagged Tests — Requires Review (1 issue)

⚠️ TestForecastWorkflow_RequestsCompletedRuns (pkg/cli/forecast_test.go:353)

Classification: Design test ✅ (but with quality gaps)

Issue 1 — Missing assertion messages: Both assertions lack a descriptive message argument:

require.NoError(t, err)                             // no message
assert.Equal(t, "completed", capturedOpts.Status)  // no message

The project guideline requires a message on every assertion call so failures are self-describing. Suggested:

require.NoError(t, err, "forecastWorkflow should not return an error")
assert.Equal(t, "completed", capturedOpts.Status, "forecastWorkflow must request completed runs, not just successful ones")

Issue 2 — Happy-path only: The test verifies the status filter is set correctly when the call succeeds, but does not test the error path (e.g. what happens when forecastListWorkflowRunsPaginated returns an error). A complementary table-driven test would increase confidence.

What design invariant does this test enforce? It verifies that forecastWorkflow requests "completed" runs from the GitHub API — the behavioral requirement that both successful and failed/cancelled runs are sampled for forecasting purposes.

What would break if deleted? A future refactor reverting Status to "success" would go undetected, silently excluding failed and cancelled runs from the forecast sample and producing skewed results.

Verdict

Check passed. 0% of new/modified tests are implementation tests (threshold: 30%). No coding-guideline violations detected.

The edit is a focused correctness fix — the renamed test and updated assertion accurately reflect the new behavioral contract. The quality score is 70 rather than higher only because the test covers the happy path exclusively and is missing assertion messages.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §27471709397

🧪 Test quality analysis by Test Quality Sentinel · 177.3 AIC · ⌖ 27.5 AIC · ⊞ 27.3K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 70/100. Test quality is acceptable — 0% of new/modified tests are implementation tests (threshold: 30%).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants