Skip to content

feat: add NATS benchmarking support#26393

Open
sreya wants to merge 17 commits into
mainfrom
nats-benchmarking
Open

feat: add NATS benchmarking support#26393
sreya wants to merge 17 commits into
mainfrom
nats-benchmarking

Conversation

@sreya

@sreya sreya commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Just an FYI I let the AI run with this one a little bit more than previous PRs related to production code since this is only for internal use.

Here's a comparison of the prototype vs the current implementation (prior to message queue, we should rerun to make sure we didn't seriously regress for some reason due to that)

High Volume Single Subject

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
100P x 10S x 8KB 397,993 529,403 3,415,410 4,516,496
100P x 10S x 64KB 51,683 55,435 516,818 554,347
10P x 100S x 8KB 353,889 441,064 6,745,004 5,478,560
10P x 100S x 64KB 61,924 65,222 4,263,735 5,794,167

High Volume Multi Subject (10 Subjects)

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
100P x 10S x 8KB 614,206 647,814 560,404 578,822
100P x 10S x 64KB 61,162 69,478 61,161 69,478
10P x 100S x 8KB 613,889 658,842 5,354,673 5,722,741
10P x 100S x 64KB 76,577 87,521 765,762 875,149

High Cardinality Publish Fan In

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
100P x 100S x 8KB 610,315 728,160 559,369 628,184
100P x 100S x 64KB 57,750 65,128 57,747 65,122

High Cardinality Fanout

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
300P x 100S x 8KB 575,351 664,537 522,973 599,190
300P x 100S x 64KB 46,218 59,080 46,217 59,080

High Cardinality Fanout

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
100P x 300S x 8KB 605,637 676,330 1,609,660 1,801,120
100P x 300S x 64KB 51,456 65,459 154,358 196,373

Global Broadcast

Row Messages Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
10P x 10S x 8KB 100,000 70,358 73,329 679,927 704,078
10P x 10S x 64KB 20,000 9,053 8,618 79,210 73,614

Global Broadcast Subscriber Fanout

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
10P x 100S x 8KB 67,993 71,440 6,467,409 6,676,229
10P x 100S x 64KB 8,568 8,466 766,898 737,429

Sharded Broadcast, 10 Subjects

Row Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
10P x 100S x 8KB 688,390 804,489 5,548,640 6,615,891
10P x 100S x 64KB 40,044 44,122 400,347 441,221

Sharded High-Cardinality Thin Fanout

Row Messages Old Pubs/sec New Pubs/sec Old Dels/sec New Dels/sec
100P x 100S x 8KB 1,000,000 759,281 800,269 625,637 763,617
100P x 100S x 64KB 200,000 39,539 45,424 39,535 45,424

sreya added 15 commits June 11, 2026 22:03
Add an importable benchmark library for the NATS-backed pubsub that
measures Pubs/sec and Deliveries/sec under high fan-out load across
configurable subjects, payload sizes, publishers, subscribers, and
replica counts.

- Deterministic plan maps publishers/subscribers to subjects and
  replicas and precomputes exact per-subscriber delivery counts.
- Probe-based readiness gate proves cross-route subscription interest
  has propagated before the measured phase, since routed drops are
  silent.
- Workload-derived sizing for listener queues and server max pending
  prevents slow-consumer drops; any drop signal invalidates the run.
- Bounded phases fail with shortfall, server-stats, and goroutine-dump
  diagnostics instead of hanging.
- TestBenchMatrix (gated behind CODER_TEST_NATS_BENCH=1) runs the
  8 KiB / 64 KiB x 1/5/10 replica matrix and renders grouped markdown
  tables; invalid runs never report a throughput number.
…ndings

Address code review findings:

- Derive MaxPending from the per-node sum of subject volumes with local
  subscribers, since MaxPending is a per-connection budget and one
  subscribe connection carries every coalesced subscription on its
  node. The previous per-subscriber derivation undersized multi-subject
  nodes.
- Derive the per-subscription pending byte limit (new LocalQueueBytes
  knob) alongside the message limit; previously the 512 MiB default
  could trip before the derived message limit.
- Pad message-count budgets with probe headroom so in-flight readiness
  probes cannot consume capacity sized for the benchmark burst.
- Warn when the derived local queue hits its cap and can no longer
  guarantee a drop-free run.
- Return partial Results on publish and flush errors for diagnostics,
  matching the documented Run contract.
- Register subscriber cleanup before subscribing so partial subscribe
  failures are cleaned up by the workload itself.
- Remove a no-op subscriber-node flush whose comment misattributed the
  interest guarantee; SubscribeWithErr flushes the SUB itself.
- Record effective (overridden) configs in matrix report rows.
- Replace the env-gated TestBenchMatrix test with a cmd/natsbench CLI:
  no flags runs the default matrix, -scenario runs one named scenario,
  and shape flags (-payload/-subjects/-publishers/-subscribers/
  -replicas) run a custom configuration. Markdown goes to stdout, logs
  to stderr, and a failed run exits nonzero. The report-to-file env var
  is gone; redirect stdout instead.
- Remove Config.withDefaults: Run now requires a fully populated
  config and validates that Timeout is positive. The CLI defaults the
  timeout to 2 minutes.
- Collapse the readiness gate's two plan inversions into a single
  subjectNodes mapping that serves as both the probe schedule and each
  subscriber's required probe set.
- Document why startPublishers parks on a barrier and when the
  zero-expectation pre-close of allDone applies.
- Drop digit-separator underscores from numeric literals.
- Encode readiness probes as a sentinel byte plus the decimal node
  index instead of a BigEndian uint64, dropping the encoding/binary and
  math dependencies and the overflow guard.
- Return publisher errors over a buffered, closed-on-completion channel
  instead of writing into a shared slice, removing any question of a
  data race on the error collection.
- Move the CLI driver into the natsbench package as the exported Main
  plus a testable cliRun.scenarios; cmd/natsbench is now a thin
  entrypoint. Adds unit coverage for scenario selection.
- Expand the plan doc comment with a concrete worked example of the
  publisher/subscriber to subject/node assignment and expected counts.
Replace the 0x5b sentinel byte with a 'natsbench-probe:' string prefix.
Both distinguish probes from the all-zero benchmark payloads equally
well, but the prefix is self-documenting in packet captures and
debuggers. Decode with strings.CutPrefix.
Collapse the library and its cmd/natsbench entrypoint into a single
package main with a main() that calls runCLI. The benchmark is now run
directly with 'go run ./coderd/x/nats/natsbench/'. Tests still live in
the same directory and continue to pass.
…comments

- Rename awaitReadiness -> awaitTopologyReady, readinessConverged ->
  isReady, readinessShortfall -> unreadySubscribers.
- Give each plan field its own comment line.
- Note why probe flushing dedupes pubNode (it is indexed by publisher,
  so multiple publishers share a node).
Compute the sorted distinct publisher and subscriber node sets once in
buildPlan (plan.pubNodes / plan.subNodes) instead of recomputing
uniqueInts at each call site, including on every iteration of the
readiness gate loop. Several publishers or subscribers can share a node,
so per-node work (flushing, burst sizing) needs the deduped set.
Drop the redundant Scenario column (the payload group header and the
Replicas column already identify each row) and the always-zero Drops
and always-empty Notes columns. A Status column is now included only
for groups that contain an invalid run, so clean matrices render as a
compact four-column table.
Pad every table cell to its column's widest value so the raw markdown
also lines up in a fixed-width terminal, instead of relying on a
markdown viewer to align ragged pipes. Numeric columns stay
right-aligned and the Status column is left-aligned.
The standard matrix now runs with 3 publisher and 3 subscriber
connections (DefaultConns) to match the prior natsbench harness, which
spreads same-subject hashing across connections and raises single-node
throughput over the production 1/1 default. New -publish-conns and
-subscribe-conns flags apply to every run, so 1/1 production behavior is
still reproducible with -publish-conns 1 -subscribe-conns 1.
Drop the trailing colons from table separator rows. Cells are already
padded for terminal alignment, so the GitHub markdown alignment hints
added visual noise without changing the rendered terminal output.
Add Subjects, Publishers, and Subscribers columns to the report so the
workload shape is explicit. The default matrix holds these constant, but
named-scenario overrides and custom runs vary them, and a table that
hides the shape is easy to misread.
probeNode ran string(payload) on every delivered message, allocating a
full copy of the (up to 64 KiB) payload per delivery. At high fan-out
this dominated runtime via GC pressure and understated throughput by up
to ~10x. Compare the probe prefix as bytes against a package-level byte
slice and convert only the tiny trailing node index to a string, and
only for actual probes, so benchmark payloads cost no allocation.
@sreya sreya requested a review from spikecurtis June 16, 2026 01:45
@sreya

sreya commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/coder-agents-review

@coder-agents-review

coder-agents-review Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Chat: Review posted | View chat
Requested: 2026-06-16 03:53 UTC by @sreya
Spend: $43.74 / $100.00

Review history
  • R1 (2026-06-16): 1 reviewers, 1 P2, COMMENT. Review
  • R2 (2026-06-16): 16 reviewers, 5 Nit, 2 P2, 8 P3, COMMENT. Review
  • R3 (2026-06-16), 5 Nit, 2 P2, 8 P3, COMMENT. Review

deep-review v0.7.1 | Round 3 | 3da8226..16222be

Last posted: Round 3, 15 findings (2 P2, 8 P3, 5 Nit), COMMENT. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P2 Author fixed (c13018d) natsbench.go:25 Config.Messages doc says "Zero means DefaultMessages" but validate() rejects zero R1 Netero Yes
CRF-2 P2 Author fixed (16222be) scenario.go:38 Default scenarios never exercise cross-node routing (Subjects % Replicas == 0 for all combinations) R2 Hisoka Yes
CRF-3 P3 Author fixed (16222be) sizing.go:114 applySizing omits too-small warning for LocalQueueBytes unlike sibling fields R2 Mafuuu P3, Knov P3 Yes
CRF-4 P3 Author fixed (16222be) scenario.go:46 DefaultScenarios omits Timeout, a required validated field R2 Meruem P3, Hisoka Nit, Kite Note Yes
CRF-5 P3 Open report.go:100 GFM separator row ignores alignment metadata R2 Meruem P3, Zoro P3 Yes
CRF-6 P3 Author fixed (16222be) report.go:191 formatInt reimplements humanize.Comma from existing dependency R2 Robin P3, Zoro Nit Yes
CRF-7 P3 Author fixed (16222be) readiness.go:27 probePrefixBytes and probeNode comments restate implementation R2 Gon Yes
CRF-8 P3 Open workload.go:316 goroutineDump uses unnamed buffer constants R2 Gon Yes
CRF-9 P3 Author fixed (16222be) workload.go:166 Drop-invalidation runtime path has no test R2 Chopper P3, Bisky P4 Yes
CRF-10 P3 Author fixed (16222be) main.go:88 context.Background() means SIGINT cannot cancel running benchmarks R2 Kite Yes
CRF-11 Nit Author fixed (16222be) workload.go:205 wg.Add/Done pattern can use Go 1.25+ wg.Go() R2 Ging-Go Yes
CRF-12 Nit Author fixed (16222be) report.go:193 HasPrefix + manual slice can use strings.CutPrefix R2 Ging-Go Yes
CRF-13 Nit Open report_internal_test.go:136 Two NotContains assertions test strings RenderMarkdown never produces R2 Bisky Yes
CRF-14 Nit Open topology.go:22 "Close" in closeAll doc is ambiguous R2 Gon Yes
CRF-15 Nit Open workload.go:149 "subscribe subscriber" stutters verb and noun R2 Leorio Yes

Law analysis

Effective LOC: 2137. Head SHA: 38fb5d3. Verdict: Don't split. Enforcement: Advisory.

Round log

Round 1

Netero-only. 1 P2. Reviewed against 3da8226..38fb5d3.

Round 2

Panel (16 reviewers). CRF-1 addressed. 1 P2, 8 P3, 5 Nit new. Reviewed against 3da8226..c13018d.

Round 3

BLOCKED. 9 addressed, 5 silent (CRF-5, CRF-8, CRF-13, CRF-14, CRF-15). No review.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review (Netero). This is a mechanical scan only; the full review panel has not yet reviewed this PR. One P2 finding on a misleading doc comment. After this is addressed, the panel will review.

The tool is well-structured: deterministic plan with exact accounting, bounded waits with diagnostics, clean error paths, and 30% test density. The architecture is sound for an internal benchmark CLI.

"A caller constructing Config{Messages: 0, ...} and calling Run() gets 'messages must be at least 1, got 0'." (Netero)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/nats/natsbench/natsbench.go Outdated
type Config struct {
// Messages is the TOTAL number of messages across all publishers,
// split evenly with the remainder assigned to publisher 0. Zero
// means DefaultMessages.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 [CRF-1] The doc comment says "Zero means DefaultMessages" but validate() at line 94 rejects Messages < 1, and Run() calls validate() before any defaulting. A direct caller of Run() with Messages: 0 gets an error, not the default. The CLI layer handles the zero-to-default conversion in cliRun.scenarios(), so the tool works, but this comment makes a false contract claim on the Config type.

Either remove "Zero means DefaultMessages" from the comment, or add a defaulting step in Run() before validation (e.g. if cfg.Messages == 0 { cfg.Messages = DefaultMessages }).

(Netero)

🤖

Run validates a fully populated config and applies no defaults, so the
Messages comment claiming 'Zero means DefaultMessages' was a false
contract (validate rejects Messages < 1). Likewise PublishConns and
SubscribeConns do not default in natsbench; zero passes through to
nats.Options, which applies the single-connection default. Clarify on
the Config type that defaulting happens in the CLI, required fields
must be set, and only LocalQueue*/MaxPending are derived when zero.
@sreya

sreya commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panel review (round 2, 16 reviewers). CRF-1 from round 1 verified fixed. 1 P2, 8 P3, 5 Nit.

The tool is well-engineered for its scope: deterministic workload planning with exact accounting, bounded waits with rich diagnostics, probe-based readiness gating, and adaptive buffer sizing. The concurrency model (Killua, Takumi) is correct: atomics with no contention on the hot path, clean shutdown ordering, correct sync.Once usage for completion signaling. The problem-solution fit (Pariston) is solid with no simpler alternative that would handle multi-replica readiness probing and drop detection.

The main finding: the default scenario matrix never exercises cross-node message routing because Subjects (10) is always divisible by every Replicas value (1, 5, 10). The readiness gate runs but proves nothing; the throughput numbers compare resource contention, not routing cost.

"The readiness gate code is exercised but its invariant (cross-route interest has propagated) is never actually tested." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/nats/natsbench/scenario.go Outdated

var scenarios []Scenario
for _, payload := range payloads {
for _, replicas := range []int{1, 5, 10} {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 [CRF-2] The default scenario matrix never exercises cross-node message routing. The round-robin assignment pubNode[i] = i % Replicas and subNode[j] = j % Replicas co-locates every publisher-subscriber pair on the same subject whenever Subjects % Replicas == 0. The default matrix uses Subjects=10 with Replicas=1, 5, 10, all divisors of 10. Verified: zero cross-node pairs across all default scenarios.

Consequences: (1) the readiness gate runs for Replicas>1 but every subscriber's required probe set is its own local node, so probes satisfy instantly without proving route propagation; (2) throughput numbers at different replica counts measure resource contention, not routing cost; (3) TestRunCluster uses Subjects=3, Replicas=3, same divisibility, same co-location, so the cross-node delivery path has no integration coverage.

Fix: choose a Replicas value that does not divide Subjects for at least one scenario. Replicas=3 with Subjects=10 produces cross-node pairs and forces the readiness gate to prove route propagation.

(Hisoka)

🤖

Comment thread coderd/x/nats/natsbench/sizing.go Outdated
slog.F("derived_msgs", wantQueue),
)
}
if cfg.LocalQueueBytes <= 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-3] applySizing omits the too-small warning for LocalQueueBytes, breaking the pattern the other two sizing fields establish. Both LocalQueueMsgs (lines 106-113) and MaxPending (lines 119-127) warn when the caller's explicit value is below the derived size. LocalQueueBytes only derives when zero and silently accepts any explicit value. A programmatic caller who sets it too small gets silent drops and "invalid run: dropped-message signals" with no hint the sizing override caused it.

Fix: add a case cfg.LocalQueueBytes < derivedQueueBytes(pl, cfg.PayloadSize) warning branch matching the other two fields.

(Mafuuu P3, Knov P3)

🤖

scenarios = append(scenarios, Scenario{
Name: fmt.Sprintf("%s-%dr", payload.label, replicas),
Config: Config{
Messages: messages,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-4] DefaultScenarios() returns Config structs without setting Timeout, a required validated field. Config.validate() rejects Timeout <= 0, and the CRF-1 fix (c13018d) just documented "every required field must be set." The CLI always fills Timeout before calling Run, so the tool works. But a caller who feeds DefaultScenarios() directly to Run() gets a validation error with no hint that Timeout is the caller's responsibility.

Fix: either set a default Timeout in DefaultScenarios() or document the omission in the function comment.

(Meruem P3, Hisoka Nit, Kite Note)

🤖

writeRow(table[0])

// Separator row: plain dashes sized to each column.
_, _ = b.WriteString("|")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-5] The GFM separator row does not express right-alignment for numeric columns. writeAlignedTable computes per-column alignment and pads cells accordingly, but the separator emits plain --- for every column. GFM requires ---: for right-alignment. The raw terminal output looks correct (cell padding works), but rendered markdown (e.g., pasted into a PR description, the documented use case) left-aligns all columns, making rate comparisons harder.

Fix: use the aligns slice already passed to the function:

if aligns[col] == alignRight && width > 1 {
    sep = strings.Repeat("-", width-1) + ":"
}

(Meruem P3, Zoro P3)

🤖

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding is still unaddressed. The separator row still emits plain --- for all columns. Please fix, explicitly decline, or file a ticket.

🤖

Comment thread coderd/x/nats/natsbench/report.go Outdated
}

// formatInt renders an integer with comma thousands separators.
func formatInt(n int64) string {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-6] formatInt reimplements humanize.Comma(int64) from github.com/dustin/go-humanize, which is already in go.mod and used elsewhere in the codebase (coderd/x/chatd/chatd.go, coderd/autobuild/, coderd/database/dbpurge/). The 17-line hand-rolled copy can be replaced with humanize.Comma(n). formatRate could similarly use humanize.Commaf(rate) or humanize.Comma(int64(math.Round(rate))).

(Robin P3, Zoro Nit)

🤖

Comment thread coderd/x/nats/natsbench/workload.go Outdated
errCh := make(chan error, len(w.pl.perPubMsgs))
var wg sync.WaitGroup
for i := range w.pl.perPubMsgs {
wg.Add(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-11] wg.Add(1); go func(){ defer wg.Done(); ... }() is the pre-1.25 WaitGroup pattern. wg.Go(func(){ ... }) does the same in one call. go.mod declares 1.26.

(Ging-Go)

🤖

Comment thread coderd/x/nats/natsbench/report.go Outdated
// formatInt renders an integer with comma thousands separators.
func formatInt(n int64) string {
s := strconv.FormatInt(n, 10)
neg := strings.HasPrefix(s, "-")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-12] HasPrefix + manual slice is the pre-1.20 pattern. s, neg := strings.CutPrefix(s, "-") collapses three lines into one.

(Ging-Go)

🤖

require.Contains(t, out, header)
}
require.NotContains(t, out, "Status")
require.NotContains(t, out, "Scenario")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-13] require.NotContains(t, out, "Scenario") and require.NotContains(t, out, "Drops") test strings that RenderMarkdown never produces under any input. Unlike the "Status" check on line 135 (which verifies the column is omitted for all-clean groups), these guard against columns that were never implemented. Either remove or add a comment noting future-proofing intent.

(Bisky)

🤖

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding is still unaddressed. The two NotContains assertions for "Scenario" and "Drops" are still present. Please remove, add a justifying comment, or explicitly decline.

🤖

nodes []*nats.Pubsub
}

// closeAll shuts down every node. Close is idempotent and always

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-14] "Close is idempotent" is ambiguous. It could refer to closeAll itself or the underlying node.Close(). Write Pubsub.Close or node.Close() to clarify the referent.

(Gon)

🤖

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding is still unaddressed. "Close is idempotent" still reads ambiguously. Please clarify or explicitly decline.

🤖

node := w.top.nodes[w.pl.subNode[j]]
cancel, err := node.SubscribeWithErr(subjectName(w.pl.subSubject[j]), w.listener(st))
if err != nil {
return xerrors.Errorf("subscribe subscriber %d: %w", j, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-15] "subscribe subscriber %d" stutters the verb and noun. Reads as subscribe subscriber 3: connection refused. Consider "register subscriber %d: %w" or "subscribe: subscriber %d: %w" to separate them.

(Leorio)

🤖

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding is still unaddressed. The error message still reads "subscribe subscriber %d". Please fix or explicitly decline.

🤖

@sreya sreya removed the request for review from spikecurtis June 16, 2026 02:57
- Default matrix now uses replica counts (1, 3, 9) coprime with the
  subject count (10) so cluster scenarios actually exercise cross-node
  routing; previously divisor counts co-located every pub/sub pair and
  the readiness gate proved nothing. TestRunCluster likewise uses
  coprime Subjects/Replicas for cross-node integration coverage.
- applySizing now warns when an explicit LocalQueueBytes is below the
  derived size, matching LocalQueueMsgs and MaxPending.
- Wire SIGINT/SIGTERM cancellation through the CLI; the run loop stops
  launching scenarios once interrupted instead of emitting confusing
  topology errors. Move os.Exit out of the deferred-stop scope.
- Replace hand-rolled formatInt with humanize.Comma.
- Add unit tests for the drop-invalidation path (dropState, listener
  drop accounting, awaitPhase fail-fast).
- Trim probe comments to the why; use wg.Go for publisher goroutines.
- Document that DefaultScenarios leaves Timeout unset for the caller.
@sreya

sreya commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 3 is blocked. The author addressed 9 of 14 findings from round 2 (including the P2 cross-node routing fix, signal cancellation, drop-path testing, and several P3s). Five findings remain without a response or code change. Further review is blocked until these are addressed, explicitly declined, or deferred with a ticket.

Silent findings needing a response:

  1. CRF-5 (P3, report.go:100): GFM separator row ignores right-alignment metadata. The function's doc says "GitHub-flavored markdown table" but numeric columns render left-aligned in GitHub. Fix, decline, or file a ticket.

  2. CRF-8 (P3, workload.go:314): goroutineDump uses unnamed 1<<20 and 16<<20 constants. Named constants would make the growth strategy self-documenting. Fix or decline.

  3. CRF-13 (Nit, report_internal_test.go:126): Two NotContains assertions test strings RenderMarkdown never produces. Remove or add a comment noting future-proofing intent.

  4. CRF-14 (Nit, topology.go:22): "Close is idempotent" is ambiguous (closeAll vs node.Close). Clarify the referent.

  5. CRF-15 (Nit, workload.go:149): "subscribe subscriber %d" stutters verb and noun. Consider "register subscriber %d" or "subscribe: subscriber %d".

🤖 This review was automatically generated with Coder Agents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant