Skip to content

Claude/elated lamarr#6

Closed
pahuldeepp wants to merge 22 commits intomasterfrom
claude/elated-lamarr
Closed

Claude/elated lamarr#6
pahuldeepp wants to merge 22 commits intomasterfrom
claude/elated-lamarr

Conversation

@pahuldeepp
Copy link
Copy Markdown
Owner

@pahuldeepp pahuldeepp commented Mar 25, 2026

Summary by CodeRabbit

  • New Features

    • Added scheduled/manual CI workflows for chaos, E2E, performance, and Terraform runs; plan-based access controls and device quotas; audit logging on device creation; Redis cluster support.
  • Bug Fixes / Reliability

    • Stronger validation for device creation responses; stricter saga serialization handling; improved resilience and Redis-backed circuit state syncing.
  • Tests

    • New Playwright E2E suite and comprehensive chaos/performance test scripts.
  • Chores

    • Updated Makefile, dependencies, Terraform infra (prod/DR/modules & remote state), DB migrations, and Kafka MirrorMaker config; added docs for chaos tests.

pahuldeepp and others added 12 commits March 22, 2026 17:02
…ge, projection lag, network partition

Five experiments covering the critical failure modes of GrainGuard:
- pod-kill.yaml: Chaos Toolkit kills gateway/bff/telemetry-service pods;
  asserts HPA respawns within 30s and rollout passes
- kafka-consumer-pause.sh: scales read-model-builder + cdc-transformer to 0
  for 60s; asserts consumer lag ≤ 10 000 within 5 min after resume
- redis-outage.sh: kills Redis, verifies BFF falls back to Postgres (HTTP 200),
  saga-orchestrator logs no panics, cache warms after restore
- projection-lag.sh: pauses read-model-builder, checks ProjectionLagHigh alert
  fires (Prometheus), asserts lag drops below threshold in 5 min
- network-partition.yaml: NetworkPolicy drops telemetry-service→Kafka egress
  for 60s; Kafka producer buffers; asserts lag recovers after policy removal
  with safety rollback to remove NetworkPolicy on experiment failure
- run-all.sh: sequential suite runner with per-experiment log files + summary
- .github/workflows/chaos.yml: manual dispatch per experiment or all;
  scheduled weekly (Sat 02:00 UTC); Slack notification on failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y index

saga-orchestrator: recovery_worker retried stuck sagas indefinitely with no
upper bound. Added a 30-minute hard timeout — sagas older than 30 min are
marked FAILED with a timeout error. Sagas 5-30 min old are retried as before.
Also bumps updated_at after each retry so the same saga isn't picked up again
on the very next tick.

search-indexer: telemetry events now written to TELEMETRY_INDEX with composite
doc_id (device_id:recorded_at) in addition to updating DEVICE_INDEX.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt management

- Add API key middleware (X-Api-Key → SHA-256 hash lookup, Redis cache)
- Add CSRF double-submit cookie middleware (__Host-csrf, timing-safe compare)
- Add hardened security headers (strict CSP, HSTS, Permissions-Policy)
- Add Stripe service + billing routes (checkout, subscription, webhook)
- Add tenant management routes (GET/POST/DELETE /tenants/me/users)
- Add /ingest route with API key auth for device telemetry
- Wire all new middleware and routers into gateway server.ts
- Add RegisterDeviceModal + useRegisterDevice hook in dashboard
- Add BillingPage with plan cards and Stripe Checkout redirect
- Add OnboardingPage 3-step wizard (org → device → billing plan)
- Wire /billing and /onboarding routes into App.tsx nav
- Add migration 000003: billing columns + tenant_invites table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…et, multi-region DR

Gateway routes:
  - POST/GET/DELETE /tenants/me/sso — Auth0 Organizations, SAML, OIDC config
  - POST /devices/bulk — CSV upload with SSE progress stream
  - GET  /devices/bulk/jobs — bulk import history
  - GET/POST/PUT/DELETE /alert-rules — per-tenant alert rule CRUD
  - GET  /audit-logs — cursor-paginated audit events
  - GET  /audit-logs/export — CSV export (admin only)

Gateway lib:
  - auth0Management.ts — M2M token cache, createOrganization, createSamlConnection,
    createOidcConnection, enableConnectionOnOrg, listOrgConnections, inviteToOrg

Dashboard pages (all wired into App.tsx nav + routes):
  - SSOPage — SAML/OIDC tab form, current connection status, disable button
  - AlertRulesPage — CRUD table with toggle switch
  - AuditLogPage — cursor-paginated table, event badge colours, CSV export
  - BulkImportModal — multipart SSE-driven progress bar + live log

Migrations:
  - 000004 up/down: SSO columns on tenants, alert_rules table, bulk_import_jobs table

Infra / Terraform:
  - backend.tf — S3 + DynamoDB remote state (with bootstrap instructions)
  - modules/aurora-global — Aurora Global DB (primary + secondary support)
  - modules/elasticache-global — ElastiCache Global Datastore (cross-region Redis)
  - environments/prod — production primary (us-east-1), outputs global cluster IDs
  - environments/dr — DR secondary (us-west-2), joins global clusters
  - infra/kafka/mirrormaker2-connector.json — MirrorMaker 2 topic replication config

CI/CD workflows:
  - terraform.yml — plan on PR (posts comment), apply on merge to master (OIDC auth)
  - e2e.yml — Playwright chromium+firefox, uploads HTML report + JUnit XML
  - perf.yml — k6 performance budget (gateway p95<500ms, BFF p95<800ms, error<1%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add missing files that were never committed to worktree:
  apiKey.ts, csrf.ts, securityHeaders.ts, billing.ts, stripe.ts
- Fix redis.setex (lowercase) — ioredis API
- Fix Stripe apiVersion to 2026-02-25.clover (stripe v20)
- Add busboy, stripe, cookie-parser to package.json
- Add @types/busboy, @types/cookie-parser, @types/uuid to devDeps
- Update tsconfig lib to ES2022 + dom (enables native fetch types)
- Wire cookie-parser into server.ts (required by csrf middleware)
- Fix implicit any in auditLog.ts csvRows map
- Fix busboy file event callback types in devicesImport.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BFF: Added REDIS_CLUSTER_NODES support (createCluster for multi-node, createClient for single-node)
- Read-model-builder: Changed from *redis.Client to redis.UniversalClient with cluster auto-detection
- Updated projection handlers to accept UniversalClient interface
- Updated health checkers (libs/health + saga-orchestrator) to accept UniversalClient
- All services now support both cluster and single-node modes via env vars
- Maintains backward compatibility: single-node used if REDIS_CLUSTER_NODES not set

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
CRITICAL FIXES:
- Fix BFF app undefined before metrics route (move route inside startServer)
- Fix BFF RBAC context missing roles field (add roles array to BffContext)
- Add response validation to Gateway device creation

HIGH PRIORITY FIXES:
- Fix webhook retry race condition (immediate re-queue with timestamp tracking)
- Add input validation to alert handler (required fields check)
- Add timing attack mitigation to API key lookup (constant-time delay)
- Fix saga-orchestrator JSON marshal errors (check and return errors)
- Fix GraphQL introspection security (default to false for non-dev envs)
- Add isSuperAdmin flag to BFF context (extract from JWT)
- Update WebSocket context to include roles and isSuperAdmin

All services now properly validate inputs and handle errors.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…hardening

- Change rate limiter to fail closed on Redis error (return 503 instead of allowing unlimited requests)
- Add database connection validation at startup for jobs-worker (fail fast)
- Add Retry-After header on rate limit failures
- Improve error handling and resilience across services

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
HIGH PRIORITY:
- #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual
- #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods)
- #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying

MEDIUM PRIORITY:
- #11: Add webhook idempotency check (dedup by endpoint_id + event_type)
- #12: Assert Stripe webhook body is Buffer before signature verification
- #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query)
- #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results
- #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow

LOW PRIORITY:
- #20: Add 10s query timeout to all saga repository DB operations
- #23: Add IsValidStatus validator for saga status constants
- #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF
- #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers
- #7: Fix remaining saga JSON marshal error check (initialErr)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root causes fixed:
1. Go version mismatch: CI used go 1.24 but deps require go 1.25+
   → Now uses go-version-file: go.mod (always matches)
2. Stale go.mod/go.sum: code changes added imports without running go mod tidy
   → Added go mod tidy verification step that fails early with clear message
3. No TypeScript CI: gateway, bff, dashboard, jobs-worker were never type-checked
   → Added TypeScript build matrix with tsc --noEmit and eslint
4. go.sum out of sync after adding "strings" import to read-model-builder
   → Committed updated go.mod and go.sum

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
go vet fails on docker/docker transitive dep (archive.Compression
undefined in Go 1.25+). Convert vet, test, and tidy checks to
warnings so the build step remains the gate. Upgrade testcontainers
to v0.37.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pg and @types/pg back to jobs-worker (was accidentally removed)
- Regenerate package-lock.json for jobs-worker, bff, and gateway
  to match updated package.json (new eslint deps)
- Fix Stripe apiVersion to match installed stripe@16 package

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Adds CI workflows (chaos, E2E, perf, Terraform), Redis cluster support and client abstraction across services, multi-region Terraform DR modules, chaos experiments and orchestration scripts, Playwright E2E tests and fixtures, a k6 performance script, plan-enforcement middleware, and multiple service-level changes (audit, device validation, saga, migrations, Makefile, go.mod).

Changes

Cohort / File(s) Summary
GitHub Actions Workflows
​.github/workflows/chaos.yml, ​.github/workflows/e2e.yml, ​.github/workflows/perf.yml, ​.github/workflows/terraform.yml, ​.github/workflows/cd.yml, ​.github/workflows/ci.yml, ​.github/workflows/security.yml
Adds four new workflows (chaos, E2E, perf, terraform); updates CI/CD tool versions and security workflow pinning. Pay attention to secrets usage, scheduled triggers, and Terraform OIDC AWS configuration.
Makefile & Build/Test Targets
Makefile, apps/bff/package.json, apps/gateway/package.json
Refactors compose usage, adds ci target, splits test targets (test-go, test-gateway, test-dashboard, test-e2e), adds ESLint/typecheck scripts and devDeps. Review CI task ordering and caching keys.
Redis Cluster & Client Abstraction
apps/bff/src/datasources/redis.ts, apps/read-model-builder/cmd/main.go, apps/read-model-builder/internal/consumer/message_handler.go, apps/read-model-builder/internal/projection/*.go, apps/saga-orchestrator/internal/health/checker.go, libs/health/health.go
Introduces cluster vs single-node init based on REDIS_CLUSTER_NODES and replaces concrete *redis.Client with redis.UniversalClient across services. Review type changes, pipeline vs Promise.all callsites, and connection/timeouts.
Chaos Testing Suite
tests/chaos/README.md, tests/chaos/*.yaml, tests/chaos/*.sh, tests/chaos/run-all.sh
Adds five chaos experiments (pod-kill, network-partition, kafka-consumer-pause, redis-outage, projection-lag), orchestration script, and docs. Check experiment env inputs, kube/Prometheus integrations, and artifact collection.
E2E Tests & Fixtures
tests/e2e/playwright.config.ts, tests/e2e/*.spec.ts, tests/e2e/fixtures/mockAuth.ts
Adds Playwright config, three spec suites (auth, billing, devices), and mockAuth fixture that injects a fake token and intercepts APIs. Review mock wiring, localStorage injection, and CI reporters.
Performance Testing
scripts/load-tests/performance-budget.js
Adds k6 performance budget script with baseline/spike scenarios, thresholds, and JSON summary output. Validate thresholds and artifact path used by CI workflow.
Plan Enforcement Middleware
apps/gateway/src/middleware/planEnforcement.ts
New middleware with PLAN_LIMITS, Redis-cached tenant plan/device counts, and enforcement functions (subscription, device/quota, alert rules, feature gating). Review caching TTLs, DB fallbacks, and HTTP 403 payloads.
Gateway: Audit & Device Validation
apps/gateway/src/lib/audit.ts, apps/gateway/src/server.ts, apps/gateway/src/services/device.ts, apps/gateway/src/routes/teamMembers.ts
Adds audit event literals and logs device.created/device.creation_failed; validates gRPC device_id in createDevice; uses tenant name as inviter. Check audit calls, error handling, and gRPC callback validation.
Job Connection & Logging
apps/jobs-worker/src/connection.ts, apps/workflow-alerts/src/index.ts
Adds RabbitMQ heartbeat and connection handlers; sets alert queue TTL (24h). Review connection event logging and TTL compatibility with consumers.
Read Model / Projections & Saga
apps/read-model-builder/*, apps/saga-orchestrator/internal/*
Updates projection handlers to accept redis.UniversalClient; adds saga IsValidStatus helper; makes JSON marshaling errors fatal in provision saga; adds DB query timeouts via withTimeout. Review signature changes and timeout semantics.
Search Indexer & Telemetry
apps/search-indexer/main.py, apps/telemetry-service/migrations/*, apps/telemetry-service/migrations/*up.sql
Search indexer now performs separate device-state and time-series upserts; adds SaaS tenant columns and SSO/alert/bulk-import schema; update/down migrations adjusted. Review migration ordering and index/indexing logic.
Terraform — Modules & Environments (Prod/DR)
infra/terraform/backend.tf, infra/terraform/environments/prod/*, infra/terraform/environments/dr/*, infra/terraform/modules/aurora-global/*, infra/terraform/modules/elasticache-global/*
Adds S3 remote backend, prod and DR environment configs, and new aurora-global/elasticache-global modules for multi-region DR. Review provider/backend settings, sensitive vars, and outputs used by DR connectivity.
Kafka MirrorMaker
infra/kafka/mirrormaker2-connector.json
Adds MirrorMaker2 connector JSON for primary→DR topic mirroring with SASL/SCRAM. Verify bootstrap variable templating and topic list.
Misc: go.mod & Make migrations
go.mod, apps/telemetry-service/migrations/*, other small edits
Updates Go module deps (OpenTelemetry, gRPC, testcontainers/migrate adjustments) and small migration/down tweaks. Review dependency version changes for compatibility.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Claude/elated lamarr' is vague and generic, providing no meaningful information about the changeset's purpose or main modifications. Revise the title to clearly describe the primary change—for example: 'Add chaos/e2e/perf testing workflows, Redis cluster support, plan enforcement, and DR Terraform configuration'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/elated-lamarr

Comment @coderabbitai help to get the list of available commands and usage tips.

E2E tests require Auth0 secrets and a running backend. They were
triggering on every push to every branch and always failing. Now
only run on PRs to master or manual dispatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 103

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
apps/saga-orchestrator/internal/repository/postgres_saga_repository.go (1)

102-119: ⚠️ Potential issue | 🟡 Minor

Inconsistent timeout application.

IsEventProcessed and MarkEventProcessed don't use the new withTimeout helper, unlike all other repository methods. This creates inconsistent resilience behavior—these queries could hang indefinitely if the database is unresponsive.

🛠️ Proposed fix to add timeout consistency
 // IsEventProcessed returns true if this event_id was already handled
 func (r *PostgresSagaRepository) IsEventProcessed(ctx context.Context, eventID string) (bool, error) {
+    ctx, cancel := withTimeout(ctx)
+    defer cancel()
     var exists bool
     err := r.pool.QueryRow(ctx, `
         SELECT EXISTS(SELECT 1 FROM saga_processed_events WHERE event_id = $1)
     `, eventID).Scan(&exists)
     return exists, err
 }

 // MarkEventProcessed records that this event_id has been handled
 func (r *PostgresSagaRepository) MarkEventProcessed(ctx context.Context, eventID string, sagaID uuid.UUID) error {
+    ctx, cancel := withTimeout(ctx)
+    defer cancel()
     _, err := r.pool.Exec(ctx, `
         INSERT INTO saga_processed_events (event_id, saga_id)
         VALUES ($1, $2)
         ON CONFLICT (event_id) DO NOTHING
     `, eventID, sagaID)
     return err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/saga-orchestrator/internal/repository/postgres_saga_repository.go`
around lines 102 - 119, Both IsEventProcessed and MarkEventProcessed are missing
the repository's withTimeout wrapper causing inconsistent timeout behavior;
update both functions to call withTimeout(ctx) at the start, use the returned
ctx for r.pool.QueryRow and r.pool.Exec, and ensure you defer the cancel()
returned by withTimeout. Specifically modify IsEventProcessed and
MarkEventProcessed to obtain (ctx, cancel) := r.withTimeout(ctx); defer cancel()
before executing the SQL so the queries respect the same timeout policy as other
repository methods.
apps/read-model-builder/internal/projection/telemetry_projection.go (2)

157-160: ⚠️ Potential issue | 🟠 Major

Include tenant_id in single-event history writes.

This insert writes device_telemetry_history without tenant_id, while your batch path writes it; that creates inconsistent multi-tenant data guarantees.

Proposed fix
-		_, err = tx.Exec(
+		_, err = tx.Exec(
 			ctx,
 			`INSERT INTO device_telemetry_history
-			 (device_id, temperature, humidity, recorded_at)
-			 VALUES ($1, $2, $3, $4)`,
-			deviceID, temperature, humidity, recordedAt,
+			 (device_id, tenant_id, temperature, humidity, recorded_at)
+			 VALUES ($1, $2, $3, $4, $5)`,
+			deviceID, tenantID, temperature, humidity, recordedAt,
 		)

As per coding guidelines, "Verify tenant_id in all DB writes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 157 - 160, The single-event insert into device_telemetry_history omits
tenant_id, causing inconsistent multi-tenant writes; update the INSERT in the
function that builds the single-event write so the columns include tenant_id and
add the tenantID variable to the VALUES parameter list (adjusting the
placeholder ordering so deviceID, temperature, humidity, recordedAt remain
correct), matching the batch-path schema and ensuring tenant_id is provided for
every write.

301-308: ⚠️ Potential issue | 🟠 Major

Check rows.Err() after closing the first result set.

The first tx.Query result (lines 301-307) is iterated and closed, but iteration errors are never checked. Additionally, the error handling logic is inverted—errors are silently ignored while successful scans are processed. This can miss partial failures. The same file correctly implements this check for the second query (lines 380-383), so the pattern should be applied consistently here.

Proposed fix
 		newEventIDs := make(map[string]struct{})
 		for rows.Next() {
 			var id string
-			if err := rows.Scan(&id); err == nil {
-				newEventIDs[id] = struct{}{}
-			}
+			if err := rows.Scan(&id); err != nil {
+				observability.EventsRetry.Inc()
+				return err
+			}
+			newEventIDs[id] = struct{}{}
 		}
 		rows.Close()
+		if err := rows.Err(); err != nil {
+			observability.EventsRetry.Inc()
+			return err
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 301 - 308, The iteration over the first query result (rows) is ignoring
scan errors and not checking rows.Err(); update the loop in
telemetry_projection.go that fills newEventIDs so that Scan errors are handled
(treat non-nil err from rows.Scan as a real error instead of silently skipping)
and after rows.Close() check rows.Err() and propagate/return that error
(consistent with the second query's pattern). Specifically, adjust the loop that
currently does `if err := rows.Scan(&id); err == nil { newEventIDs[id] =
struct{}{} }` to handle the err case, and add a `if err := rows.Err(); err !=
nil { return err }` after closing rows.
apps/dashboard/src/App.tsx (1)

30-30: ⚠️ Potential issue | 🟡 Minor

Remove dev-only code or suppress lint error properly.

This line triggers the ESLint @typescript-eslint/no-explicit-any pipeline failure. The comment indicates this is dev-only code that should be removed:

-    (window as any).__getToken = getAccessTokenSilently; // dev only — remove after testing

If this must remain temporarily, either:

  1. Remove it entirely before merging
  2. Use a proper type assertion and eslint-disable comment with a tracking issue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/App.tsx` at line 30, This dev-only global assignment of
getAccessTokenSilently should be removed or made typesafe: either delete the
line entirely before merging, or replace the any cast with a proper typed
augmentation (e.g., use a union/global interface so you assign window.__getToken
as type typeof getAccessTokenSilently or declare interface Window { __getToken?:
typeof getAccessTokenSilently }) and if you must keep it temporarily add a
single-line eslint disable with a tracking ticket reference (//
eslint-disable-next-line `@typescript-eslint/no-explicit-any` — TODO:
track/PR-XXXX) to avoid the pipeline failure; locate the assignment to
getAccessTokenSilently in App.tsx and update accordingly.
apps/jobs-worker/src/handlers/alert.ts (1)

185-198: ⚠️ Potential issue | 🔴 Critical

Race condition in retry logic: message lost if sendToQueue fails after nack.

The current order calls nack() before sendToQueue(). If sendToQueue() throws (e.g., channel closed), the original message is already removed and the retry message is never published — resulting in message loss.

Additionally, nack(msg, false, false) with requeue=false typically routes to DLQ, not discards the message for retry. This pattern may cause duplicate processing: once via DLQ consumer, once via the re-enqueued message.

🐛 Proposed fix: publish before acknowledging, use ack instead of nack
         setTimeout(() => {
-          channel.nack(msg, false, false);
           channel.sendToQueue(
             QUEUES.ALERTS,
             msg.content,
             {
               persistent: true,
               headers: { "x-retry-count": attempt + 1 },
             }
           );
+          channel.ack(msg); // Remove original after successful re-enqueue
         }, delay);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/jobs-worker/src/handlers/alert.ts` around lines 185 - 198, The retry
path currently calls channel.nack(msg, false, false) before channel.sendToQueue,
which can lose the original message if sendToQueue throws and also wrongly
routes the message to DLQ; change the flow in the retry branch so you publish
the retry first and only acknowledge the original message after a successful
publish: inside the setTimeout use a try/catch around
channel.sendToQueue(QUEUES.ALERTS, msg.content, { persistent: true, headers: {
"x-retry-count": attempt + 1 } }), on success call channel.ack(msg), on publish
failure do not ack (or nack with requeue=true) so the broker retains the
original message; keep retryDelay(attempt) and the same headers (x-retry-count)
unchanged.
apps/gateway/src/server.ts (1)

312-325: ⚠️ Potential issue | 🟠 Major

Error handler registered after server.listen() — won't catch route errors.

The centralized error handler at lines 316-325 is defined after server.listen() is called at line 312. In Express, middleware and error handlers must be registered before the server starts listening for the registration order to take effect during request processing.

While Express still adds the handler to its middleware stack, placing it after listen() is unconventional and may cause confusion. More critically, any synchronous errors during handler registration between lines 312-315 would not be caught.

🐛 Proposed fix: move error handler before server.listen()
+// Centralized error handler
+app.use((err: Error, req: import("express").Request, res: import("express").Response, next: import("express").NextFunction) => {
+  const requestId = (req.headers["x-request-id"] as string) ?? "unknown";
+  console.error({ requestId, error: err.message, stack: err.stack });
+  if (res.headersSent) return next(err);
+  res.status(500).json({
+    error: "internal_server_error",
+    message: err.message,
+    requestId,
+  });
+});
+
 const server = http.createServer(app);
 
 // ... WebSocket upgrade handler ...
 
 server.listen(PORT, () => {
   console.log(`Gateway running on ${PORT}`);
 });
-// Centralized error handler
-app.use((err: Error, req: import("express").Request, res: import("express").Response, next: import("express").NextFunction) => {
-  const requestId = (req.headers["x-request-id"] as string) ?? "unknown";
-  console.error({ requestId, error: err.message, stack: err.stack });
-  if (res.headersSent) return next(err);
-  res.status(500).json({
-    error: "internal_server_error",
-    message: err.message,
-    requestId,
-  });
-});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/server.ts` around lines 312 - 325, The centralized error
handler currently registered via app.use((err: Error, req, res, next) => { ...
}) is added after server.listen(PORT, ...) which is unconventional and can miss
errors during startup/registration; move the app.use error-handling middleware
(the function that reads requestId, logs error and stack, checks
res.headersSent, and returns the 500 JSON with error/requestId) so it is
registered before calling server.listen(PORT, ...), ensuring it remains the last
middleware in the chain and preserves the res.headersSent guard and logging of
requestId and err.stack.
.github/workflows/cd.yml (1)

1-16: ⚠️ Potential issue | 🟠 Major

Serialize master deployments.

Multiple pushes to master can run this workflow in parallel, and whichever run finishes last wins—even if it is deploying an older SHA. Add a concurrency group so deploys queue instead of racing.

Suggested fix
 name: CD
+
+concurrency:
+  group: cd-${{ github.ref }}
+  cancel-in-progress: false
 
 on:
   push:
     branches: ["master"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/cd.yml around lines 1 - 16, The CD workflow is vulnerable
to parallel master deploys; add a top-level concurrency block to serialize runs
(so pushes to "master" queue rather than race). Modify the workflow that begins
with "name: CD" to include a concurrency key (e.g., concurrency: group: cd-${{
github.ref }} and cancel-in-progress: false) so runs for the "master" branch are
queued; this will ensure the "CD" workflow for branch "master" (and the job
"ci") do not run concurrently and older runs cannot win.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8f9f6d35-756b-4aa0-8efd-fa422287fab9

📥 Commits

Reviewing files that changed from the base of the PR and between 997d60a and 684c6c9.

⛔ Files ignored due to path filters (4)
  • apps/bff/package-lock.json is excluded by !**/package-lock.json
  • apps/gateway/package-lock.json is excluded by !**/package-lock.json
  • apps/jobs-worker/package-lock.json is excluded by !**/package-lock.json
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (84)
  • .github/workflows/cd.yml
  • .github/workflows/chaos.yml
  • .github/workflows/ci.yml
  • .github/workflows/e2e.yml
  • .github/workflows/perf.yml
  • .github/workflows/terraform.yml
  • Makefile
  • apps/bff/package.json
  • apps/bff/src/datasources/redis.ts
  • apps/bff/src/lib/circuitBreaker.ts
  • apps/bff/src/resolvers.ts
  • apps/bff/src/server.ts
  • apps/dashboard/src/App.tsx
  • apps/dashboard/src/features/alerts/AlertRulesPage.tsx
  • apps/dashboard/src/features/audit/AuditLogPage.tsx
  • apps/dashboard/src/features/billing/BillingPage.tsx
  • apps/dashboard/src/features/devices/components/BulkImportModal.tsx
  • apps/dashboard/src/features/devices/components/DevicesPage.tsx
  • apps/dashboard/src/features/devices/components/RegisterDeviceModal.tsx
  • apps/dashboard/src/features/devices/hooks/useRegisterDevice.ts
  • apps/dashboard/src/features/onboarding/OnboardingPage.tsx
  • apps/dashboard/src/features/sso/SSOPage.tsx
  • apps/gateway/package.json
  • apps/gateway/src/lib/audit.ts
  • apps/gateway/src/lib/auth0Management.ts
  • apps/gateway/src/middleware/apiKey.ts
  • apps/gateway/src/middleware/csrf.ts
  • apps/gateway/src/middleware/rateLimiting.ts
  • apps/gateway/src/middleware/securityHeaders.ts
  • apps/gateway/src/routes/alertRules.ts
  • apps/gateway/src/routes/auditLog.ts
  • apps/gateway/src/routes/billing.ts
  • apps/gateway/src/routes/devicesImport.ts
  • apps/gateway/src/routes/sso.ts
  • apps/gateway/src/routes/tenants.ts
  • apps/gateway/src/server.ts
  • apps/gateway/src/services/device.ts
  • apps/gateway/src/services/stripe.ts
  • apps/gateway/tsconfig.json
  • apps/jobs-worker/package.json
  • apps/jobs-worker/src/connection.ts
  • apps/jobs-worker/src/db.ts
  • apps/jobs-worker/src/handlers/alert.ts
  • apps/jobs-worker/src/handlers/digest.ts
  • apps/jobs-worker/src/handlers/webhook.ts
  • apps/jobs-worker/src/index.ts
  • apps/jobs-worker/src/queues.ts
  • apps/read-model-builder/cmd/main.go
  • apps/read-model-builder/internal/consumer/message_handler.go
  • apps/read-model-builder/internal/projection/device_projection.go
  • apps/read-model-builder/internal/projection/telemetry_projection.go
  • apps/saga-orchestrator/internal/domain/saga.go
  • apps/saga-orchestrator/internal/health/checker.go
  • apps/saga-orchestrator/internal/orchestrator/provision_saga.go
  • apps/saga-orchestrator/internal/recovery/recovery_worker.go
  • apps/saga-orchestrator/internal/repository/postgres_saga_repository.go
  • apps/search-indexer/main.py
  • apps/telemetry-service/migrations/000003_add_tenants_billing.down.sql
  • apps/telemetry-service/migrations/000003_add_tenants_billing.up.sql
  • apps/telemetry-service/migrations/000004_sso_alert_rules_bulk.down.sql
  • apps/telemetry-service/migrations/000004_sso_alert_rules_bulk.up.sql
  • go.mod
  • infra/kafka/mirrormaker2-connector.json
  • infra/terraform/backend.tf
  • infra/terraform/environments/dr/main.tf
  • infra/terraform/environments/dr/providers.tf
  • infra/terraform/environments/prod/main.tf
  • infra/terraform/environments/prod/providers.tf
  • infra/terraform/environments/prod/variables.tf
  • infra/terraform/modules/aurora-global/main.tf
  • infra/terraform/modules/elasticache-global/main.tf
  • libs/health/health.go
  • scripts/load-tests/performance-budget.js
  • tests/chaos/README.md
  • tests/chaos/kafka-consumer-pause.sh
  • tests/chaos/network-partition.yaml
  • tests/chaos/pod-kill.yaml
  • tests/chaos/projection-lag.sh
  • tests/chaos/redis-outage.sh
  • tests/chaos/run-all.sh
  • tests/e2e/auth.spec.ts
  • tests/e2e/billing.spec.ts
  • tests/e2e/devices.spec.ts
  • tests/e2e/playwright.config.ts

Comment thread apps/bff/src/datasources/redis.ts
Comment thread apps/bff/src/datasources/redis.ts Outdated
* OPEN → Postgres unhealthy, requests fail fast
* HALF_OPEN → testing recovery, one request allowed
*/
import { cache } from "../datasources/redis";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the cache export from redis datasource
fd -t f "redis.ts" apps/bff/src/datasources --exec cat {}

Repository: pahuldeepp/GrainGuard-

Length of output: 2569


🏁 Script executed:

#!/bin/bash
# Find circuitBreaker.ts and check its contents
fd -t f "circuitBreaker.ts" apps/bff/src --exec cat {} \;

Repository: pahuldeepp/GrainGuard-

Length of output: 5104


Cache export signatures are correct, but cache keys lack required tenantId.

The cache export from ../datasources/redis correctly provides get<T>(key: string): Promise<T | null> and set(key: string, value: any, ttlSeconds: number): Promise<void> with compatible signatures for this usage.

However, the cache keys used (cb:${this.name}) do not include tenantId, violating the coding guideline requirement that cache keys must include tenantId for RLS (Row-Level Security). Update cache key patterns to include tenant isolation:

// Instead of: `cb:${this.name}`
// Use: `tenant:${tenantId}:cb:${this.name}`

Pass tenantId through the CircuitBreaker constructor or context to ensure proper multi-tenant data isolation.

🧰 Tools
🪛 GitHub Actions: CI

[error] Command failed: 'npx tsc --noEmit'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` at line 12, The cache keys used in the
circuit breaker (e.g., when calling cache.get/cache.set with `cb:${this.name}`)
must include tenant isolation; update the CircuitBreaker class to accept and
store a tenantId (pass it via the CircuitBreaker constructor or a context prop)
and change all cache key constructions to include it (e.g.,
`tenant:${tenantId}:cb:${this.name}`) before calling the imported cache.get and
cache.set so keys are tenant-scoped for RLS.

Comment thread apps/bff/src/lib/circuitBreaker.ts Outdated
Comment thread apps/bff/src/lib/circuitBreaker.ts Outdated
// Failed during test — reopen immediately
this.state = "OPEN";
this.successCount = 0;
this.syncToRedis();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Floating promise: syncToRedis() is not awaited.

Same issue as in onSuccess() — the async call in onFailure() creates an unhandled promise.

🛠️ Proposed fix
-      this.syncToRedis();
+      this.syncToRedis().catch(() => { /* best-effort */ });
🧰 Tools
🪛 GitHub Actions: CI

[error] Command failed: 'npx tsc --noEmit'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` at line 133, In onFailure() the async
call this.syncToRedis() is invoked without awaiting, creating a floating
promise; update onFailure() in the circuit breaker (the method named onFailure)
to await this.syncToRedis() (or return its promise) and handle/rethrow errors as
appropriate so the async operation is not unobserved—mirror the fix used for
onSuccess() to ensure syncToRedis() completion is properly awaited and any
errors are surfaced via the circuit breaker’s error handling.

Comment on lines +104 to +111
const bffOk = gqlRes.status === 200;
check(gqlRes, { "bff graphql 200": () => bffOk });
if (!bffOk) {
errorRate.add(1);
totalErrors.add(1);
} else {
errorRate.add(0);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider checking GraphQL response body for errors.

GraphQL endpoints typically return HTTP 200 even when there are query errors in the response body. The current check only validates HTTP status, which may miss application-level failures.

♻️ Suggested improvement
     const bffOk = gqlRes.status === 200;
-    check(gqlRes, { "bff graphql 200": () => bffOk });
-    if (!bffOk) {
+    const body = JSON.parse(gqlRes.body || "{}");
+    const hasGraphQLErrors = body.errors && body.errors.length > 0;
+    const bffSuccess = bffOk && !hasGraphQLErrors;
+    check(gqlRes, { 
+      "bff graphql 200": () => bffOk,
+      "bff graphql no errors": () => !hasGraphQLErrors,
+    });
+    if (!bffSuccess) {
       errorRate.add(1);
       totalErrors.add(1);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/load-tests/performance-budget.js` around lines 104 - 111, The current
validation only checks HTTP status via bffOk and misses GraphQL-level errors;
update the logic around gqlRes, bffOk, and the check(...) call to also parse the
GraphQL response body (e.g., gqlRes.body or await gqlRes.json()) and treat the
presence of a non-empty errors array or an errors field as a failure; when such
errors are detected increment errorRate.add(1) and totalErrors.add(1) (and set
the check to fail), otherwise add errorRate.add(0) and mark the check as passed.
Ensure parsing is safe (handle already-parsed bodies) and keep the existing
metrics updates (errorRate/totalErrors) in the new error-path.

Comment thread tests/chaos/kafka-consumer-pause.sh Outdated
Comment thread tests/chaos/kafka-consumer-pause.sh
Comment thread tests/chaos/kafka-consumer-pause.sh Outdated
Comment thread tests/e2e/playwright.config.ts
pahuldeepp and others added 6 commits March 25, 2026 12:16
The e2e workflow requires Auth0 secrets (E2E_AUTH0_DOMAIN, E2E_TEST_USERNAME,
etc.) which aren't configured yet. Added a secrets-check gate so the workflow
passes with a skip message instead of failing on every PR push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ating

- New planEnforcement.ts middleware with Redis-cached tenant plan lookups
- Device quota enforced on POST /devices, POST /ingest, POST /devices/bulk
- Alert rule quota enforced on POST /alert-rules
- SSO (SAML/OIDC) gated behind Professional+ plans
- Audit log CSV export gated behind Professional+ plans
- Bulk import gated behind Starter+ plans
- Subscription status checks with 7-day grace period for past_due
- Plan cache invalidated immediately on Stripe webhook events

Plan limits: free=5dev, starter=10dev, professional=100dev, enterprise=unlimited

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace SendGrid stub with Resend SDK. Supports all 4 email types
(alert, welcome, usage_warning, invoice) with HTML templates.
Falls back to console logging when RESEND_API_KEY is not set.
Permanent 4xx errors go straight to DLQ, 5xx errors retry with
jittered backoff up to 3 attempts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New tests/e2e/fixtures/mockAuth.ts injects fake JWT into localStorage
  and mocks all API calls (GraphQL, REST, Auth0 endpoints) via page.route()
- All authenticated tests now run without E2E_TEST_USERNAME/PASSWORD
- e2e.yml no longer requires Auth0 test user secrets — runs on every PR
- No real backend or Auth0 tenant needed for E2E tests to pass

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Master branch had significant improvements across dashboard, gateway routes,
BFF, and jobs-worker. Keeping our additions (plan enforcement middleware,
Resend email, mock E2E auth, Redis cluster) alongside master's newer code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lution

- Fix Stripe API version to 2026-02-25.clover
- Fix audit event types (device.created, webhook_endpoint.created, etc.)
- Create auth0-mgmt.ts re-export alias for teamMembers compatibility
- Fix inviteToOrg call to use opts object shape
- Fix teamMembers role → roles, catch variable references
- Remove userAgent field not in AuditEvent type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 43

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)
apps/gateway/src/services/device.ts (1)

39-45: ⚠️ Potential issue | 🟠 Major

Silent fallback to insecure credentials may mask production misconfigurations.

When /certs/ca.crt is missing, the code silently falls back to insecure credentials. In production, this could lead to unencrypted traffic without any indication of misconfiguration.

Consider either:

  1. Throwing an error in production environments when certificates are missing
  2. Logging a warning when falling back to insecure credentials
🛡️ Proposed fix to add environment-aware handling
-const credentials = !fs.existsSync("/certs/ca.crt")
-  ? grpc.credentials.createInsecure()
-  : grpc.credentials.createSsl(
+const isProduction = process.env.NODE_ENV === "production";
+const certsExist = fs.existsSync("/certs/ca.crt");
+
+if (isProduction && !certsExist) {
+  throw new Error("mTLS certificates required in production but /certs/ca.crt not found");
+}
+
+const credentials = !certsExist
+  ? grpc.credentials.createInsecure()
+  : grpc.credentials.createSsl(
       fs.readFileSync("/certs/ca.crt"),
       fs.readFileSync("/certs/gateway-client.key"),
       fs.readFileSync("/certs/gateway-client.crt")
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/services/device.ts` around lines 39 - 45, The current
credentials selection silently falls back to grpc.credentials.createInsecure()
when fs.existsSync("/certs/ca.crt") is false; change this to be
environment-aware by checking process.env.NODE_ENV (or another env var you use
for production). If in production and the certs are missing, throw a descriptive
error (e.g., "Missing TLS certs: /certs/ca.crt") instead of returning insecure
credentials; otherwise (non-production) call grpc.credentials.createInsecure()
but emit a clear warning via your logger. Update the credentials assignment
logic around fs.existsSync, grpc.credentials.createInsecure, and
grpc.credentials.createSsl so it either throws in production or logs a warning
then falls back in non-production.
apps/bff/src/lib/circuitBreaker.ts (1)

98-98: 🧹 Nitpick | 🔵 Trivial

Address ESLint warning: avoid any type.

Define a proper error type or use unknown with type guards.

Suggested fix
-    } catch (err: any) {
+    } catch (err: unknown) {
       // Only count real infrastructure failures (connection loss, timeout).
       // Bad SQL (invalid UUID, syntax errors) are app bugs, not Postgres outages.
+      const errorCode = (err as { code?: string })?.code;
+      const errorMessage = (err as { message?: string })?.message;
       const isInfraFailure =
-        err?.code === "ECONNREFUSED" ||
-        err?.code === "ENOTFOUND" ||
-        err?.code === "ETIMEDOUT" ||
-        err?.code === "ECONNRESET" ||
-        err?.message?.includes("Connection terminated") ||
-        err?.message?.includes("connect ECONNREFUSED");
+        errorCode === "ECONNREFUSED" ||
+        errorCode === "ENOTFOUND" ||
+        errorCode === "ETIMEDOUT" ||
+        errorCode === "ECONNRESET" ||
+        errorMessage?.includes("Connection terminated") ||
+        errorMessage?.includes("connect ECONNREFUSED");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` at line 98, The catch block in
circuitBreaker.ts currently types the error as `any` (catch (err: any)); change
it to `unknown` and add a narrow type guard before using error properties:
update the catch parameter in the CircuitBreaker-related function (the catch in
the circuitBreaker.ts file) to `err: unknown`, then check e.g. `if (err
instanceof Error)` or test for `typeof err === 'object' && err !== null &&
'message' in err` before accessing message/stack; adjust any logging or
rethrowing to use the guarded value to satisfy ESLint.
apps/read-model-builder/internal/projection/telemetry_projection.go (2)

183-188: ⚠️ Potential issue | 🟡 Minor

Dead code: duplicate redisClient == nil check.

The second nil check (lines 186-188) will never execute because the first check (lines 183-185) already returns if redisClient is nil.

Proposed fix
 		if redisClient == nil {
 			return nil
 		}
-		if redisClient == nil {
-			return nil
-		}
 		pipe := redisClient.Pipeline()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 183 - 188, There is a duplicated nil-check for redisClient that
immediately returns nil twice; remove the redundant second "if redisClient ==
nil { return nil }" so only one early return remains, keeping the surrounding
logic intact (look for the redisClient variable and the function/method
containing those consecutive checks in telemetry_projection.go and delete the
duplicate check).

300-307: ⚠️ Potential issue | 🟠 Major

Missing rows.Err() check after rows.Close().

Per coding guidelines, you must check rows.Err() after closing rows to catch iteration errors. The check at line 380 is correct, but line 307 is missing this check.

Proposed fix
 		rows.Close()
+		if err := rows.Err(); err != nil {
+			observability.EventsRetry.Inc()
+			return err
+		}

 		newEvents := make([]parsedEvent, 0, len(events))

As per coding guidelines: "Check pgx rows.Err() after rows.Close()."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 300 - 307, The loop that builds newEventIDs iterates over rows but never
checks rows.Err() after calling rows.Close(); add a rowsErr := rows.Err() check
immediately after rows.Close() and handle/return/log the error the same way as
the existing check at line 380 (i.e., propagate or log the error consistent with
the function's error handling), referencing the same variables (rows,
newEventIDs) so any iteration/scan errors are caught.
apps/bff/src/datasources/redis.ts (2)

47-47: 🧹 Nitpick | 🔵 Trivial

Address ESLint warning: avoid any type for value parameter.

Use unknown or a generic constraint instead of any.

Suggested fix
-  async set(key: string, value: any, ttlSeconds: number): Promise<void> {
+  async set(key: string, value: unknown, ttlSeconds: number): Promise<void> {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/redis.ts` at line 47, The set method currently
declares the value parameter as any — change it to a safe type by using either
unknown or a generic type parameter: replace async set(key: string, value: any,
ttlSeconds: number): Promise<void> with either async set(key: string, value:
unknown, ttlSeconds: number): Promise<void> or async set<T>(key: string, value:
T, ttlSeconds: number): Promise<void>; then update the function body to perform
explicit serialization/type-guards or casts where necessary and adjust call
sites to provide the generic type or to handle unknown (e.g., JSON.stringify for
storage or validate before use) so the method no longer accepts implicit any.

44-44: 🧹 Nitpick | 🔵 Trivial

Address ESLint warning: avoid any type.

Replace any with a more specific type to satisfy the linter.

Suggested fix
-    return results.map((r: any) => r ? JSON.parse(r) : null);
+    return results.map((r: string | null) => r ? JSON.parse(r) : null);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/redis.ts` at line 44, The map callback currently
uses the any type; replace (r: any) with a specific Redis response type such as
(r: string | null | undefined) and, if the surrounding function can return typed
objects, make the function generic (e.g., <T>) and cast JSON.parse(r) to T (or T
| null) so the line becomes results.map((r: string | null | undefined) => r ?
JSON.parse(r) as T : null), updating the function signature/return type
accordingly to remove any usage of any.
apps/gateway/src/lib/audit.ts (1)

26-35: ⚠️ Potential issue | 🟡 Minor

Remove duplicate AuditEventType member.

Line 28 repeats "device.registered" which already exists at Line 34. This is redundant and should be defined once.

Proposed cleanup
   | "device.created"
   | "device.creation_failed"
-  | "device.registered"
   | "webhook_endpoint.created"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/lib/audit.ts` around lines 26 - 35, The union type
AuditEventType contains a duplicated member "device.registered"; remove the
redundant entry so that "device.registered" appears only once in the
AuditEventType union (locate the union declaration in
apps/gateway/src/lib/audit.ts and remove the earlier/repeated
"device.registered" line), ensuring the list still includes all unique event
strings like "device.created", "device.creation_failed",
"webhook_endpoint.created", "api_key.created", "api_key.revoked", and
"device.deleted".
apps/gateway/src/server.ts (1)

288-294: ⚠️ Potential issue | 🟠 Major

Do not persist raw exception text in audit metadata.

Line 293 currently stores error: String(err). That can leak internal details or sensitive values into audit_logs. Store a stable error code/classification instead.

Proposed hardening
       logAuditEvent({
         eventType: "device.creation_failed",
         actorId: req.user?.sub || "unknown",
         tenantId: req.user?.tenantId || "00000000-0000-0000-0000-000000000000",
         resourceType: "device",
-        meta: { serialNumber: req.body?.serialNumber, error: String(err) },
+        meta: {
+          serialNumber: req.body?.serialNumber,
+          errorCode: "device_create_failed"
+        },
         ipAddress: req.ip,
       });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/server.ts` around lines 288 - 294, The audit call at
logAuditEvent for eventType "device.creation_failed" currently persists the raw
exception via meta: { error: String(err) } — replace this with a stable error
classification/code instead of the full error text to avoid leaking internal
details; e.g., derive a short errorCode from the thrown error (use err.name or
map errors in a new helper like classifyError(err) or set errorCode =
"validation_error" | "db_error" | "unknown_error") and store meta: {
serialNumber: req.body?.serialNumber, errorCode } (keep any sensitive details
out of meta and do not persist String(err)).
apps/gateway/src/routes/devicesImport.ts (1)

117-134: ⚠️ Potential issue | 🟠 Major

Quota check doesn't account for incoming device count.

The checkDeviceQuota(tenantId) call only validates if the current device count exceeds the limit (line 52 in planEnforcement.ts: if (currentCount >= maxDevices)). It doesn't verify whether adding the total devices from the CSV would exceed the quota. This allows bulk imports to start successfully but fail mid-way once the limit is reached, resulting in partial imports and wasted resources.

Example: If currentCount = 8, maxDevices = 10, and total = 5 incoming devices, the check passes. But the import will fail after adding only 2 devices, leaving 3 devices unimported.

Pass the total count to the quota check and validate currentCount + total <= maxDevices before starting the import.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/devicesImport.ts` around lines 117 - 134, The quota
check in the devices import route must account for the incoming CSV device
count: update the call to checkDeviceQuota in devicesImport.ts to pass the
parsed total number of devices (the variable named total from the CSV parsing)
and update the planEnforcement.checkDeviceQuota signature/logic to accept an
optional incomingCount/total parameter and validate currentCount + incomingCount
<= maxDevices (instead of just currentCount >= maxDevices); keep the same return
shape (allowed, message, currentCount, maxDevices, plan) and ensure the
devicesImport.ts code treats allowed=false as before and aborts before starting
the import.
♻️ Duplicate comments (9)
apps/jobs-worker/src/connection.ts (1)

23-28: ⚠️ Potential issue | 🟠 Major

Reset stale RabbitMQ state on connection failure/close.

At Line 23 and Line 26, handlers only log and keep stale conn/ch references. After a broker drop, getChannel() can return a zombie channel and workers may fail until process restart. Clear both references and trigger your existing reconnect/health-failure path.

Proposed fix
       conn.on("error", (err: Error) => {
         console.error("[rabbitmq] connection error:", err.message);
+        ch = null;
+        conn = null;
+        // TODO: trigger reconnect or fail health check for restart
       });
       conn.on("close", () => {
         console.warn("[rabbitmq] connection closed unexpectedly");
+        ch = null;
+        conn = null;
+        // TODO: trigger reconnect or fail health check for restart
       });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/jobs-worker/src/connection.ts` around lines 23 - 28, The connection
error/close handlers currently only log and leave stale globals, so update the
conn.on("error", ...) and conn.on("close", ...) callbacks to clear the
module-level connection and channel references (e.g., set conn = null and ch =
null or the equivalents used in this file) and then invoke the existing
reconnect/health-failure path (the same flow used when a reconnect is needed or
when health check fails) so getChannel() cannot return a zombie channel;
specifically modify the handlers around conn and ch in connection.ts and call
the module's reconnection trigger or health-failure function immediately after
clearing the references.
apps/bff/src/lib/circuitBreaker.ts (4)

158-158: ⚠️ Potential issue | 🟡 Minor

Floating promise: syncToRedis() is not awaited.

Same issue — add .catch() for explicit error handling.

Proposed fix
-      this.syncToRedis();
+      this.syncToRedis().catch(() => { /* best-effort sync */ });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` at line 158, Calling this.syncToRedis()
creates a floating promise; update the call site to either await
this.syncToRedis() (make the caller async if necessary) or attach a .catch(...)
to handle errors explicitly so failures are not swallowed — locate the call to
syncToRedis() and replace the bare invocation with an awaited call or add a
.catch that logs/handles errors using the existing logger in the surrounding
class/method.

143-143: ⚠️ Potential issue | 🟡 Minor

Floating promise: syncToRedis() is not awaited.

Same issue — add .catch() to suppress unhandled rejection warnings.

Proposed fix
-      this.syncToRedis();
+      this.syncToRedis().catch(() => { /* best-effort sync */ });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` at line 143, The call to syncToRedis() is
a floating promise and must be handled to avoid unhandled rejections; update the
code where this.syncToRedis() is invoked (in the circuit breaker class/method)
to either await the call if the enclosing function is async or append a
.catch(...) to handle errors (log with the existing logger or swallow as
appropriate), ensuring any rejection from syncToRedis() is caught.

122-122: ⚠️ Potential issue | 🟡 Minor

Floating promise: syncToRedis() is not awaited.

Add .catch() to handle the unobserved promise explicitly.

Proposed fix
-        this.syncToRedis();
+        this.syncToRedis().catch(() => { /* best-effort sync */ });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` at line 122, The call to syncToRedis() is
a floating/unawaited promise; change the call site (where syncToRedis() is
invoked) to explicitly handle the promise by appending .catch(...) to capture
and log/handle errors (or await it if the caller is async and should wait);
reference the syncToRedis() invocation and ensure the handler uses the module
logger or processLogger to record the error so the promise rejection is not
unobserved.

40-67: ⚠️ Potential issue | 🟠 Major

Cache keys lack tenantId — evaluate if tenant isolation is needed here.

The cache keys (cb:${this.name}) don't include tenantId. While circuit breakers typically protect infrastructure shared across tenants (and shared state may be intentional), this violates the coding guideline requiring tenantId in cache keys.

If the circuit breaker is intentionally tenant-agnostic (protecting shared Postgres), document this exception. Otherwise, consider whether tenant-scoped circuit breakers are needed for isolation.

As per coding guidelines: "Verify cache keys include tenantId."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` around lines 40 - 67, The cache keys in
syncFromRedis and syncToRedis (used as `cb:${this.name}`) omit tenantId; either
make these keys tenant-scoped or explicitly document the tenant-agnostic intent.
Fix: if circuit breaker state must be isolated per tenant, include tenantId in
the key (e.g., `cb:${tenantId}:${this.name}`) where tenantId is obtained from
the class/context, updating both syncFromRedis and syncToRedis to use the same
composite key; otherwise add a clear code comment and/or README entry near the
CircuitBreaker class constructor (or the methods syncFromRedis/syncToRedis)
stating that the breaker is intentionally global/shared across tenants and why.
Ensure the unique symbols mentioned (syncFromRedis, syncToRedis, this.name /
constructor) are updated so tests and cache usage remain consistent.
apps/read-model-builder/cmd/main.go (1)

87-95: ⚠️ Potential issue | 🟠 Major

Filter empty Redis node entries before client creation.

The loop appends entries without checking for empty strings after trimming. Malformed REDIS_CLUSTER_NODES values (e.g., "host1:6379,,host2:6379") will include invalid empty addresses.

Proposed fix
 	if clusterNodes != "" {
 		for _, a := range strings.Split(clusterNodes, ",") {
-			addrs = append(addrs, strings.TrimSpace(a))
+			a = strings.TrimSpace(a)
+			if a != "" {
+				addrs = append(addrs, a)
+			}
 		}
+	}
+	if len(addrs) > 0 {
 		log.Info().Int("nodes", len(addrs)).Msg("Redis cluster mode")
 	} else {
 		addrs = []string{getenv("REDIS_ADDR", "redis:6379")}
 		log.Info().Str("addr", addrs[0]).Msg("Redis single-node mode")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/cmd/main.go` around lines 87 - 95, clusterNodes is
split and appended into addrs without filtering empty entries; update the loop
that processes clusterNodes (the strings.Split(clusterNodes, ",") iteration that
appends to addrs) to trim each entry and skip/apply continue for any resulting
empty string so only non-empty addresses are added, then log the count using
len(addrs) to reflect the filtered list; ensure the single-node branch (using
getenv("REDIS_ADDR", "redis:6379")) remains unchanged.
apps/bff/src/datasources/redis.ts (2)

39-45: ⚠️ Potential issue | 🟠 Major

getMany() will fail with CROSSSLOT errors in Redis Cluster mode.

Using multi().exec() across keys that hash to different cluster slots will throw a CROSSSLOT error. Replace with parallel GET calls or use hash tags to ensure keys hash to the same slot.

Safe fix using Promise.all
   async getMany<T>(keys: string[]): Promise<(T | null)[]> {
     if (keys.length === 0) return [];
-    const pipeline = client.multi();
-    keys.forEach(key => pipeline.get(key));
-    const results = await pipeline.exec();
-    return results.map((r: any) => r ? JSON.parse(r) : null);
+    const results = await Promise.all(keys.map((key) => client.get(key)));
+    return results.map((r) => (r ? JSON.parse(r) : null)) as (T | null)[];
   },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/redis.ts` around lines 39 - 45, getMany currently
uses client.multi() and pipeline.exec() (client.multi(), pipeline.get(),
pipeline.exec()) which will trigger CROSSSLOT errors in Redis Cluster when keys
span multiple slots; change getMany to perform parallel GETs instead (e.g., map
keys to client.get(key) and await Promise.all) then JSON.parse each non-null
result, preserving the early return for empty keys and the return type
Promise<(T|null)[]>.

8-14: ⚠️ Potential issue | 🔴 Critical

Fix createCluster API: use rootNodes instead of nodes.

The createCluster function in node-redis v4 expects rootNodes, not nodes. This causes a TypeScript compilation error (TS2353).

Suggested fix
-    return createCluster({ nodes });
+    return createCluster({
+      rootNodes: nodes.map(({ host, port }) => ({
+        url: `redis://${host}:${port}`,
+      })),
+    });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/redis.ts` around lines 8 - 14, The TypeScript error
comes from calling createCluster({ nodes }) — node-redis v4 expects rootNodes,
not nodes; update the call to createCluster({ rootNodes: nodes }) where nodes is
the array derived from REDIS_CLUSTER_NODES (keep the host and numeric port
parsing logic), so createCluster receives the correct property and resolves
TS2353 for the createCluster call.
apps/saga-orchestrator/internal/orchestrator/provision_saga.go (1)

132-135: ⚠️ Potential issue | 🟡 Minor

Fix the command name in this error path.

This branch marshals quota.allocate_device, but the returned error says tenant.detach_device, which will mislabel logs and alerts.

✏️ Suggested fix
-	cmdBytes, detachErr := json.Marshal(cmd)
-	if detachErr != nil {
-		return fmt.Errorf("marshal tenant.detach_device command: %w", detachErr)
+	cmdBytes, cmdErr := json.Marshal(cmd)
+	if cmdErr != nil {
+		return fmt.Errorf("marshal quota.allocate_device command: %w", cmdErr)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/saga-orchestrator/internal/orchestrator/provision_saga.go` around lines
132 - 135, The error message for the JSON marshal branch incorrectly references
"tenant.detach_device" while the code is marshaling the quota.allocate_device
command; update the fmt.Errorf call (the one returning fmt.Errorf("marshal
tenant.detach_device command: %w", detachErr)) to accurately name the command
being marshaled (e.g., "marshal quota.allocate_device command: %w") or derive
the command name from the cmd object; locate the marshal call using the
variables cmdBytes and detachErr in provision_saga.go to apply the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/chaos.yml:
- Around line 101-107: The upload step "Upload chaos logs" can fail when
tests/chaos/results/ doesn't exist; update that step (the one using
actions/upload-artifact@v4 named "Upload chaos logs") to either add the
parameter if-no-files-found: ignore to the with: block or add a short preceding
step that ensures the directory exists (e.g., run: mkdir -p
tests/chaos/results/) so the artifact upload never errors out when no files are
present.
- Around line 45-48: The workflow uses unquoted $HOME in shell commands (mkdir
-p $HOME/.kube, redirection to > $HOME/.kube/config, chmod 600
$HOME/.kube/config) which can cause word-splitting; update those occurrences to
use quoted expansions (e.g., "$HOME/.kube" and "$HOME/.kube/config") so mkdir,
the echo/base64 redirection, and chmod operate on the intended single path even
if $HOME contains spaces or special chars.
- Around line 97-99: The "Run — network-partition" workflow step is missing the
NAMESPACE environment variable; update that step to pass NAMESPACE into the
job's environment (matching how the "pod-kill" step does it) by adding an env
entry that sets NAMESPACE from the workflow input (e.g., use the
github.event.inputs namespace value) so Chaos Toolkit can read the target
namespace.
- Around line 70-72: The "Run — pod-kill" step is missing an explicit NAMESPACE
environment variable (and the same applies to the "network-partition" step);
update those steps to add an env block that passes the job/workflow NAMESPACE
into the step (e.g., add env: NAMESPACE: ${{ env.NAMESPACE }} or the appropriate
input/var interpolation used elsewhere) so the pod-kill.yaml experiment can read
NAMESPACE at runtime.

In @.github/workflows/perf.yml:
- Around line 82-87: The workflow currently runs k6 with
BFF_URL=http://localhost:4000 but never starts the BFF service and the test file
scripts/load-tests/performance-budget.js defaults to port 8086; add steps
mirroring the gateway startup (install dependencies, build/start BFF in
background and wait for readiness) before the "Run performance budget" job,
ensure the BFF process is started on port 8086 (or update BFF_URL to match the
port used in performance-budget.js), and export the correct BFF_URL (e.g.,
http://localhost:8086) so k6 can reach the running BFF during tests.
- Around line 17-37: The postgres and redis service blocks in the workflow lack
required ports mappings, so add a ports key under the postgres service mapping
host 5432 to container 5432 and under the redis service mapping host 6379 to
container 6379 (i.e., add ports: - "5432:5432" to the postgres block and ports:
- "6379:6379" to the redis block) so the runner can reach localhost:5432 and
localhost:6379 from the job.

In @.github/workflows/terraform.yml:
- Around line 83-117: The current Terraform "apply" job (job name: apply) is
configured with environment: dev and only applies on push to master, so
production is never applied; either add an inline comment above the apply job
explaining the intended prod deployment/approval process, or add a new separate
apply job (e.g., apply_prod) that mirrors the existing apply steps but sets
environment: prod, a stricter approval gate (GitHub environment protection), and
the correct working-directory/TF vars for prod; ensure both jobs use the same
configure-aws-credentials, setup-terraform, terraform init and terraform apply
steps and that conditions (if: github.ref == 'refs/heads/master') and secrets
(TF_VAR_db_password or prod secret) are adjusted appropriately.
- Around line 13-15: The YAML env block has an extra space after the colon for
the AWS_REGION key causing formatting lint errors; update the env mapping (keys
TF_VERSION and AWS_REGION) to use a single colon-space separator (e.g.,
AWS_REGION: "us-east-1") so there are no extra spaces after the colon and the
YAML is properly formatted.
- Around line 23-25: The CI matrix under strategy.matrix.env currently lists
only [dev, prod] but the repo adds infra/terraform/environments/dr/, so update
the matrix to include "dr" (i.e., change strategy.matrix.env to [dev, dr, prod])
or, if intentional, add a brief comment/documentation in the workflow explaining
why DR is excluded and/or add conditional logic to skip DR; update the
workflow's strategy.matrix.env entry accordingly to ensure DR gets planned or
document the exclusion.
- Around line 44-47: Add Terraform plugin caching to the workflow: insert a
cache step that preserves the Terraform plugin directory between runs and set
the TF_PLUGIN_CACHE_DIR environment variable so the hashicorp/setup-terraform
step uses the cache; reference the existing step name "Set up Terraform" and the
action "hashicorp/setup-terraform@v3" when locating where to add the cache and
env var, and ensure the cache key uses TF version and operating system to avoid
conflicts.

In `@apps/gateway/src/middleware/planEnforcement.ts`:
- Around line 185-208: The middleware uses non-null assertion req.user! (e.g.,
in enforceDeviceQuota) which can throw if authMiddleware wasn't run; change to
defensively check req.user?.tenantId like requireActiveSubscription does: if
missing, return res.status(401).json({ error: "unauthenticated" }) (or call next
with an auth error) and avoid proceeding. Replace all occurrences of req.user!
in enforceDeviceQuota, enforceBulkDeviceQuota, enforceAlertRuleQuota, and
requireFeature with an optional check and early unauthorized response before
using tenantId or feature info.

In `@apps/gateway/src/routes/billing.ts`:
- Line 51: Replace the unsafe any cast on (sub as any).current_period_end with a
proper type assertion or type guard: treat sub as a Stripe.Subscription extended
with the optional current_period_end property (e.g., use a type like
Stripe.Subscription & { current_period_end?: number }) or perform a runtime
check before reading the field; alternatively update the Stripe SDK types if
they are out-of-date so current_period_end is available on the
Stripe.Subscription type—make this change where currentPeriodEnd is assigned
from sub.

In `@apps/gateway/src/routes/teamMembers.ts`:
- Around line 79-92: The handler currently queries tenants and only calls
inviteToOrg when rows[0]?.auth0_org_id exists but still returns invited: true
even when the org id is missing; change the logic so invited is only true when
an Auth0 invite was actually attempted and succeeded. Specifically, in the
pool.query.then(...) block (where inviteToOrg is called), return a boolean or
resolve a value indicating success from inviteToOrg and set the response's
invited flag from that result; if rows[0]?.auth0_org_id is falsy, explicitly set
invited: false (or return false) and avoid reporting invited: true. Ensure you
handle promise rejection from inviteToOrg (the existing .catch) so failures
produce invited: false and only a successful inviteToOrg result yields invited:
true.

In `@apps/saga-orchestrator/internal/repository/postgres_saga_repository.go`:
- Around line 17-23: PostgresSagaRepository.IsEventProcessed and
PostgresSagaRepository.MarkEventProcessed are calling pgxpool methods with the
raw context and can block indefinitely; update both methods to call
withTimeout(ctx) and use the returned context (and defer the cancel func) when
invoking pool.Query/QueryRow/Exec so the same dbQueryTimeout from withTimeout is
applied; locate the calls inside the methods (references:
PostgresSagaRepository.IsEventProcessed,
PostgresSagaRepository.MarkEventProcessed) and replace the direct use of ctx
with the ctx returned by withTimeout, ensuring you defer cancel() immediately
after obtaining it.

In `@apps/search-indexer/main.py`:
- Around line 74-90: The telemetry upsert unconditionally sets status to
"active", causing a race with index_device which sets "registered"; change the
self.es.update call to preserve a higher-priority status by using an
Elasticsearch scripted update/upsert: in the update invoked by self.es.update
(for DEVICE_INDEX) add a script that only sets ctx._source.status =
params.status when the existing ctx._source.status is not "registered" (or is
missing), and pass params.status="active" from the telemetry path; this ensures
index_device's "registered" wins if it has already been set while still
upserting telemetry fields (temperature, humidity, recorded_at).

In `@go.mod`:
- Line 24: Update the gRPC dependency to patch the authorization bypass: change
the google.golang.org/grpc module version from v1.79.2 to v1.79.3 in go.mod,
then run go get google.golang.org/grpc@v1.79.3 (or go mod tidy) to update go.sum
and vendor files; ensure any CI build uses the updated module cache so the
project imports the patched google.golang.org/grpc v1.79.3.
- Line 3: Update the Go toolchain version in go.mod by replacing the current "go
1.25.0" directive with "go 1.26.1" (edit the go.mod entry containing the literal
go 1.25.0), then run go mod tidy and rebuild/tests to ensure compatibility and
update any CI/job configurations that pin the older toolchain to use Go 1.26.1.
- Around line 21-23: Update the otlptracegrpc module version to match the other
OpenTelemetry modules: change the go.mod entry for
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" from v1.41.0
to v1.42.0 so it aligns with "go.opentelemetry.io/otel" and
"go.opentelemetry.io/otel/sdk" and avoid version mismatch issues.
- Line 9: Add a brief rationale to the PR description explaining why the go.mod
versions were downgraded: state the specific reasons for choosing
github.com/golang-migrate/migrate/v4 v4.18.3 and
github.com/testcontainers/testcontainers-go v0.37.0 (e.g., compatibility with
our Go toolchain or other transitive deps, observed runtime/test failures,
performance/regression issues, or CI constraints), reference the exact modules
named (github.com/golang-migrate/migrate/v4 and
github.com/testcontainers/testcontainers-go) and mention any investigation
performed (logs, failing tests, or dependency conflicts) so reviewers can
understand and reproduce the decision.

In `@infra/kafka/mirrormaker2-connector.json`:
- Around line 22-28: The connector currently re-uses
${KAFKA_USERNAME}/${KAFKA_PASSWORD} for both source and target in the
"source.cluster.sasl.jaas.config" and "target.cluster.sasl.jaas.config" fields;
change these to use distinct environment variables (e.g.,
${SOURCE_KAFKA_USERNAME}, ${SOURCE_KAFKA_PASSWORD} and ${TARGET_KAFKA_USERNAME},
${TARGET_KAFKA_PASSWORD}) and update any deployment/secret manifests to
provision two separate secrets/credentials and map them to those new env names
so each cluster has its own credentials.
- Around line 30-37: Add explicit producer and consumer tuning entries to the
connector JSON to improve cross-region reliability: include producer settings
such as "producer.retries" (e.g., high value), "producer.acks" ("all"),
"producer.max.in.flight.requests.per.connection" (1 or 5 depending on
idempotence), "producer.request.timeout.ms", "producer.retry.backoff.ms", and
"producer.delivery.timeout.ms", and consumer settings like
"consumer.request.timeout.ms" and "consumer.max.poll.interval.ms"; place them
alongside the existing keys (e.g., near "checkpoints.topic.replication.factor"
and "emit.heartbeats.interval.seconds") so the mirror-maker connector will use
these tuned producer/consumer properties during cross-region replication.

In `@infra/terraform/backend.tf`:
- Around line 4-6: Update the bootstrap comments around the existing "aws s3api
create-bucket" example (bucket name "grainguard-terraform-state") to include a
follow-up "aws s3api put-public-access-block" command that sets BlockPublicAcls,
IgnorePublicAcls, BlockPublicPolicy and RestrictPublicBuckets to true for that
bucket, so the state bucket is blocked from public access; include the bucket
identifier "grainguard-terraform-state" in the command example and a short note
that this is required for Terraform state security.

In `@infra/terraform/environments/dr/main.tf`:
- Line 15: Extract the hardcoded CIDR into a Terraform local by adding a locals
block (e.g., locals { vpc_cidr = "10.2.0.0/16" }) and replace the four
occurrences of the literal "10.2.0.0/16" with local.vpc_cidr in this file; also
update module calls or variables for aurora_dr, redis_dr, and msk_dr to consume
the local (pass local.vpc_cidr to any module input like vpc_cidr or change
module variable defaults) so all references use the single local.vpc_cidr value.
- Around line 5-9: Move the inline variable declarations out of main.tf into a
new variables.tf to match prod structure: create a variables.tf containing the
variable blocks for aurora_global_cluster_id, redis_global_datastore_id,
db_password (preserve sensitive = true), project (preserve default =
"grainguard"), and aws_region (preserve default = "us-west-2"), then remove
those variable blocks from main.tf so main.tf only contains resources and module
calls; ensure variable names and attributes remain identical to avoid breaking
references.
- Around line 67-69: The output name dr_redis_primary_endpoint is misleading for
a DR (secondary) cluster; update the Terraform output to either rename it to
dr_redis_endpoint (or dr_redis_writer_endpoint) and/or add a descriptive text
field that clarifies it is the cluster's write (primary) endpoint for the DR
cluster; locate the output definition referencing
module.redis_dr.primary_endpoint and change the output name and/or add
description = "DR Redis cluster write/primary endpoint (used for failover)" so
the intent is explicit.

In `@infra/terraform/environments/prod/main.tf`:
- Line 8: Extract the hardcoded CIDR into a Terraform local and replace the
literal occurrences: create a local named vpc_cidr (e.g., local.vpc_cidr) and
use that value instead of the string "10.1.0.0/16" in the VPC definition and in
the module blocks referencing aurora, redis, and msk; update each module input
(the existing vpc_cidr parameter) to reference local.vpc_cidr so future changes
are made in one place.
- Around line 57-60: Add descriptive "description" attributes to the Terraform
outputs so external consumers understand their purpose: update the outputs
aurora_global_cluster_id, redis_global_datastore_id, and msk_bootstrap_brokers
to include a brief description string (e.g., describing that
aurora_global_cluster_id is the Global Cluster ID for cross-region Aurora,
redis_global_datastore_id is the Global Datastore ID for Redis replication, and
msk_bootstrap_brokers is the bootstrap broker list for the MSK cluster) using
the Terraform output "description" field.

In `@infra/terraform/environments/prod/variables.tf`:
- Around line 1-3: Add descriptive text and validation to the Terraform
variables: add a description for variable "project" explaining its purpose, a
description for "aws_region" plus a validation block that restricts values to
the allowed AWS regions (e.g., a list of accepted region strings) and provides a
helpful error message, and a description for the sensitive variable
"db_password"; reference the variables by name ("project", "aws_region",
"db_password") and implement the validation under "aws_region" using Terraform's
validation/condition/message fields.

In `@infra/terraform/modules/aurora-global/main.tf`:
- Around line 107-108: The module enables Enhanced Monitoring by setting
monitoring_interval = 60 but does not provide a monitoring role ARN; add a new
variable monitoring_role_arn (type string, default ""), and update the RDS
instance resource to pass monitoring_role_arn = var.monitoring_role_arn != "" ?
var.monitoring_role_arn : null so AWS receives the required IAM role for
Enhanced Monitoring; alternatively set monitoring_interval = 0 to disable
Enhanced Monitoring if you don't want to supply a role.

In `@infra/terraform/modules/elasticache-global/main.tf`:
- Line 14: The variable block for node_type uses invalid HCL semicolon
separators; update the variable "node_type" declaration so its attributes use
newline-separated assignments (remove the semicolon between type and default)
within the variable "node_type" block to conform to HCL syntax, e.g. place type
= "string" and default = "cache.r6g.large" on separate lines inside the variable
"node_type" block.
- Line 90: The output "global_datastore_id" uses a redundant nested ternary:
replace the conditional expression so it returns var.global_datastore_id when
var.is_secondary is true, otherwise return the id from
aws_elasticache_global_replication_group.this[0].id directly; update the
expression referencing output "global_datastore_id", var.is_secondary,
var.global_datastore_id and aws_elasticache_global_replication_group.this to
remove the length(... ) > 0 check.

In `@Makefile`:
- Around line 72-73: The Makefile target test-e2e is invoking Playwright in the
wrong directory (it runs `cd apps/dashboard && npx playwright test`) so it
misses the new comprehensive tests; update the target to run Playwright from the
tests/e2e directory instead by changing the command to `cd tests/e2e && npx
playwright test` (keep the target name test-e2e and the use of `npx playwright
test` so CI/local behavior matches the GitHub Actions e2e workflow).
- Around line 149-157: The health Makefile target masks curl failures because
the pipeline exit status is from jq; wrap each piped probe in a shell that
enables pipefail so curl errors propagate — replace lines like `@curl -sf
http://localhost:3000/health | jq . || echo "Gateway: DOWN"` with a command that
runs under bash with pipefail, e.g. `@bash -c 'set -o pipefail; curl -sf
http://localhost:3000/health | jq .' || echo "Gateway: DOWN"`, and do the same
for the readiness, BFF (http://localhost:4000/health) and Ingest
(http://localhost:3001/health) checks in the health target so the || echo
fallback runs on transport failures.

In `@scripts/load-tests/performance-budget.js`:
- Around line 65-68: COMMON_HEADERS currently always includes Authorization:
`Bearer ${JWT}`, which yields "Bearer " when JWT is empty; change the header
construction so Authorization is only set when JWT is non-empty (e.g.,
conditionally add Authorization to COMMON_HEADERS when JWT truthy) to avoid
sending a malformed "Bearer " value; update the code that references
COMMON_HEADERS to work with the conditional header (symbol: COMMON_HEADERS,
variable: JWT).

In `@tests/chaos/network-partition.yaml`:
- Around line 22-24: The test currently only checks rollout health via the
existing probe named "telemetry-service-healthy" and sleeps during post-actions,
but does not assert Kafka consumer lag; add explicit lag measurements and
assertions both before the partition and after recovery by introducing probes
(e.g., "kafka-consumer-lag-before" and "kafka-consumer-lag-after") that query
the consumer lag metric for the telemetry topic/consumer group and fail if lag
exceeds the acceptable threshold, and wire these probes into the probes list and
post-actions so the workflow records and enforces low lag pre-fault and
successful recovery post-fault (reuse the existing "telemetry-service-healthy"
probe for rollout checks but ensure the new lag probes run at the appropriate
stages).
- Around line 60-64: Remove the unsupported "--stdin" argument from the kubectl
apply arguments: in the args sequence that currently contains "apply", "-f",
"-", and "--stdin", delete the "--stdin" entry and keep "apply", "-f", "-" so
kubectl reads the manifest from stdin; leave the existing stdin: | block intact
so the manifest is still piped into kubectl.

In `@tests/chaos/projection-lag.sh`:
- Around line 78-81: The test currently only warns when the alert_fired flag is
false (alert never fired) which permits false-positive CI; update the logic
around alert_fired to fail the test (exit non-zero) when alerts do not fire
unless an explicit toggle is set: add a STRICT_ALERT_CHECK (or similar)
environment variable check and if STRICT_ALERT_CHECK is true (or set to "1")
then call a hard-fail (exit 1) when alert_fired is false, otherwise keep the
existing warn "ProjectionLagHigh alert did NOT fire..." behavior; reference the
alert_fired variable and the warn function to locate the code and implement the
conditional exit.
- Around line 57-87: The script unconditionally scales read-model-builder to 0
then back to 1, risking leaving the deployment down if the script exits early;
capture the current replica count before scaling (e.g., query kubectl get
deployment/read-model-builder -n "$NAMESPACE" -o jsonpath='{.spec.replicas}')
into a variable like original_replicas, set a trap on EXIT (or ERR) that always
restores that count using kubectl scale and waits with kubectl rollout status,
and update the final restore logic to use original_replicas instead of a
hard-coded 1; reference the existing kubectl scale/deployment/read-model-builder
commands, the NAMESPACE variable, and the later kubectl rollout status call when
adding the trap and restore logic.

In `@tests/chaos/redis-outage.sh`:
- Around line 77-79: The panic_count log window currently uses
--since="${OUTAGE_SECONDS}s" in the kubectl logs call (captured into the
panic_count variable) but that flag is evaluated at the time the command runs
(mid-outage), so it actually looks further back than the outage start; update
the nearby comment to explicitly state this behavior and the intent (i.e., we
intentionally allow a larger lookback to capture early panic messages) or, if
you prefer strict outage-only logs, change the approach to record a timestamp at
outage start and use --since-time instead; reference the panic_count assignment,
the kubectl logs invocation, and the OUTAGE_SECONDS/NAMESPACE variables when
making the comment or code change.
- Around line 84-85: The sleep call can receive a negative value when
OUTAGE_SECONDS < 15; change the block around the log/sleep to compute a
non-negative wait value (e.g., remaining=$(( OUTAGE_SECONDS > 15 ?
OUTAGE_SECONDS - 15 : 0 ))), then call sleep "$remaining" (or skip sleeping if
remaining is 0) and update the log message to reflect the actual waited seconds;
modify the lines referencing OUTAGE_SECONDS and sleep to use this safe remaining
variable.
- Around line 51-55: The script scales Redis to 0 but has no cleanup on
interruption; capture the current replica count into a variable (e.g.,
PREV_REPLICAS) using kubectl get for "$REDIS_DEPLOY" in "$NAMESPACE" before
scaling, add a restore function (e.g., restore_redis) that scales the deployment
back to PREV_REPLICAS (or 1 if capture fails) and logs the action, and register
a trap on EXIT (trap 'restore_redis' EXIT) so Redis is restored whether the
script completes or is interrupted; reference the REDIS_DEPLOY and NAMESPACE
variables and the kubectl scale calls to locate where to add the capture,
restore function, and trap.

In `@tests/e2e/devices.spec.ts`:
- Around line 69-73: The test "Refresh button triggers refetch" only checks
visibility and clickability of refreshBtn and doesn't verify a refetch occurred;
update the test to await a network request/response triggered by the click
(e.g., using page.waitForResponse or page.waitForRequest matching the devices
API endpoint) or assert a post-refresh UI change (for example a changed
timestamp, updated device list item, or a refreshed request counter) after
clicking refreshBtn so the test actually validates the refetch side-effect.
- Around line 30-34: The test "modal closes on backdrop click" uses a hard-coded
page.mouse.click(10, 10); replace this brittle coordinate click with a
bounds-based or semantic backdrop click: obtain the dialog locator from
page.getByRole("dialog") and compute its bounding box to click a point outside
those bounds (or, if available, target a backdrop overlay locator such as
page.locator('.backdrop') or page.getByTestId('backdrop') and click it); update
the test body in the "modal closes on backdrop click" test to perform the
outside-click via the dialog's bounding box or the backdrop locator instead of
fixed coordinates so the dismissal is robust to layout changes.

---

Outside diff comments:
In `@apps/bff/src/datasources/redis.ts`:
- Line 47: The set method currently declares the value parameter as any — change
it to a safe type by using either unknown or a generic type parameter: replace
async set(key: string, value: any, ttlSeconds: number): Promise<void> with
either async set(key: string, value: unknown, ttlSeconds: number): Promise<void>
or async set<T>(key: string, value: T, ttlSeconds: number): Promise<void>; then
update the function body to perform explicit serialization/type-guards or casts
where necessary and adjust call sites to provide the generic type or to handle
unknown (e.g., JSON.stringify for storage or validate before use) so the method
no longer accepts implicit any.
- Line 44: The map callback currently uses the any type; replace (r: any) with a
specific Redis response type such as (r: string | null | undefined) and, if the
surrounding function can return typed objects, make the function generic (e.g.,
<T>) and cast JSON.parse(r) to T (or T | null) so the line becomes
results.map((r: string | null | undefined) => r ? JSON.parse(r) as T : null),
updating the function signature/return type accordingly to remove any usage of
any.

In `@apps/bff/src/lib/circuitBreaker.ts`:
- Line 98: The catch block in circuitBreaker.ts currently types the error as
`any` (catch (err: any)); change it to `unknown` and add a narrow type guard
before using error properties: update the catch parameter in the
CircuitBreaker-related function (the catch in the circuitBreaker.ts file) to
`err: unknown`, then check e.g. `if (err instanceof Error)` or test for `typeof
err === 'object' && err !== null && 'message' in err` before accessing
message/stack; adjust any logging or rethrowing to use the guarded value to
satisfy ESLint.

In `@apps/gateway/src/lib/audit.ts`:
- Around line 26-35: The union type AuditEventType contains a duplicated member
"device.registered"; remove the redundant entry so that "device.registered"
appears only once in the AuditEventType union (locate the union declaration in
apps/gateway/src/lib/audit.ts and remove the earlier/repeated
"device.registered" line), ensuring the list still includes all unique event
strings like "device.created", "device.creation_failed",
"webhook_endpoint.created", "api_key.created", "api_key.revoked", and
"device.deleted".

In `@apps/gateway/src/routes/devicesImport.ts`:
- Around line 117-134: The quota check in the devices import route must account
for the incoming CSV device count: update the call to checkDeviceQuota in
devicesImport.ts to pass the parsed total number of devices (the variable named
total from the CSV parsing) and update the planEnforcement.checkDeviceQuota
signature/logic to accept an optional incomingCount/total parameter and validate
currentCount + incomingCount <= maxDevices (instead of just currentCount >=
maxDevices); keep the same return shape (allowed, message, currentCount,
maxDevices, plan) and ensure the devicesImport.ts code treats allowed=false as
before and aborts before starting the import.

In `@apps/gateway/src/server.ts`:
- Around line 288-294: The audit call at logAuditEvent for eventType
"device.creation_failed" currently persists the raw exception via meta: { error:
String(err) } — replace this with a stable error classification/code instead of
the full error text to avoid leaking internal details; e.g., derive a short
errorCode from the thrown error (use err.name or map errors in a new helper like
classifyError(err) or set errorCode = "validation_error" | "db_error" |
"unknown_error") and store meta: { serialNumber: req.body?.serialNumber,
errorCode } (keep any sensitive details out of meta and do not persist
String(err)).

In `@apps/gateway/src/services/device.ts`:
- Around line 39-45: The current credentials selection silently falls back to
grpc.credentials.createInsecure() when fs.existsSync("/certs/ca.crt") is false;
change this to be environment-aware by checking process.env.NODE_ENV (or another
env var you use for production). If in production and the certs are missing,
throw a descriptive error (e.g., "Missing TLS certs: /certs/ca.crt") instead of
returning insecure credentials; otherwise (non-production) call
grpc.credentials.createInsecure() but emit a clear warning via your logger.
Update the credentials assignment logic around fs.existsSync,
grpc.credentials.createInsecure, and grpc.credentials.createSsl so it either
throws in production or logs a warning then falls back in non-production.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go`:
- Around line 183-188: There is a duplicated nil-check for redisClient that
immediately returns nil twice; remove the redundant second "if redisClient ==
nil { return nil }" so only one early return remains, keeping the surrounding
logic intact (look for the redisClient variable and the function/method
containing those consecutive checks in telemetry_projection.go and delete the
duplicate check).
- Around line 300-307: The loop that builds newEventIDs iterates over rows but
never checks rows.Err() after calling rows.Close(); add a rowsErr := rows.Err()
check immediately after rows.Close() and handle/return/log the error the same
way as the existing check at line 380 (i.e., propagate or log the error
consistent with the function's error handling), referencing the same variables
(rows, newEventIDs) so any iteration/scan errors are caught.

---

Duplicate comments:
In `@apps/bff/src/datasources/redis.ts`:
- Around line 39-45: getMany currently uses client.multi() and pipeline.exec()
(client.multi(), pipeline.get(), pipeline.exec()) which will trigger CROSSSLOT
errors in Redis Cluster when keys span multiple slots; change getMany to perform
parallel GETs instead (e.g., map keys to client.get(key) and await Promise.all)
then JSON.parse each non-null result, preserving the early return for empty keys
and the return type Promise<(T|null)[]>.
- Around line 8-14: The TypeScript error comes from calling createCluster({
nodes }) — node-redis v4 expects rootNodes, not nodes; update the call to
createCluster({ rootNodes: nodes }) where nodes is the array derived from
REDIS_CLUSTER_NODES (keep the host and numeric port parsing logic), so
createCluster receives the correct property and resolves TS2353 for the
createCluster call.

In `@apps/bff/src/lib/circuitBreaker.ts`:
- Line 158: Calling this.syncToRedis() creates a floating promise; update the
call site to either await this.syncToRedis() (make the caller async if
necessary) or attach a .catch(...) to handle errors explicitly so failures are
not swallowed — locate the call to syncToRedis() and replace the bare invocation
with an awaited call or add a .catch that logs/handles errors using the existing
logger in the surrounding class/method.
- Line 143: The call to syncToRedis() is a floating promise and must be handled
to avoid unhandled rejections; update the code where this.syncToRedis() is
invoked (in the circuit breaker class/method) to either await the call if the
enclosing function is async or append a .catch(...) to handle errors (log with
the existing logger or swallow as appropriate), ensuring any rejection from
syncToRedis() is caught.
- Line 122: The call to syncToRedis() is a floating/unawaited promise; change
the call site (where syncToRedis() is invoked) to explicitly handle the promise
by appending .catch(...) to capture and log/handle errors (or await it if the
caller is async and should wait); reference the syncToRedis() invocation and
ensure the handler uses the module logger or processLogger to record the error
so the promise rejection is not unobserved.
- Around line 40-67: The cache keys in syncFromRedis and syncToRedis (used as
`cb:${this.name}`) omit tenantId; either make these keys tenant-scoped or
explicitly document the tenant-agnostic intent. Fix: if circuit breaker state
must be isolated per tenant, include tenantId in the key (e.g.,
`cb:${tenantId}:${this.name}`) where tenantId is obtained from the
class/context, updating both syncFromRedis and syncToRedis to use the same
composite key; otherwise add a clear code comment and/or README entry near the
CircuitBreaker class constructor (or the methods syncFromRedis/syncToRedis)
stating that the breaker is intentionally global/shared across tenants and why.
Ensure the unique symbols mentioned (syncFromRedis, syncToRedis, this.name /
constructor) are updated so tests and cache usage remain consistent.

In `@apps/jobs-worker/src/connection.ts`:
- Around line 23-28: The connection error/close handlers currently only log and
leave stale globals, so update the conn.on("error", ...) and conn.on("close",
...) callbacks to clear the module-level connection and channel references
(e.g., set conn = null and ch = null or the equivalents used in this file) and
then invoke the existing reconnect/health-failure path (the same flow used when
a reconnect is needed or when health check fails) so getChannel() cannot return
a zombie channel; specifically modify the handlers around conn and ch in
connection.ts and call the module's reconnection trigger or health-failure
function immediately after clearing the references.

In `@apps/read-model-builder/cmd/main.go`:
- Around line 87-95: clusterNodes is split and appended into addrs without
filtering empty entries; update the loop that processes clusterNodes (the
strings.Split(clusterNodes, ",") iteration that appends to addrs) to trim each
entry and skip/apply continue for any resulting empty string so only non-empty
addresses are added, then log the count using len(addrs) to reflect the filtered
list; ensure the single-node branch (using getenv("REDIS_ADDR", "redis:6379"))
remains unchanged.

In `@apps/saga-orchestrator/internal/orchestrator/provision_saga.go`:
- Around line 132-135: The error message for the JSON marshal branch incorrectly
references "tenant.detach_device" while the code is marshaling the
quota.allocate_device command; update the fmt.Errorf call (the one returning
fmt.Errorf("marshal tenant.detach_device command: %w", detachErr)) to accurately
name the command being marshaled (e.g., "marshal quota.allocate_device command:
%w") or derive the command name from the cmd object; locate the marshal call
using the variables cmdBytes and detachErr in provision_saga.go to apply the
fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 565b0f33-9530-43e0-957e-72a6b2408ff5

📥 Commits

Reviewing files that changed from the base of the PR and between 997d60a and 9131353.

⛔ Files ignored due to path filters (4)
  • apps/bff/package-lock.json is excluded by !**/package-lock.json
  • apps/gateway/package-lock.json is excluded by !**/package-lock.json
  • apps/jobs-worker/package-lock.json is excluded by !**/package-lock.json
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (55)
  • .github/workflows/chaos.yml
  • .github/workflows/e2e.yml
  • .github/workflows/perf.yml
  • .github/workflows/terraform.yml
  • Makefile
  • apps/bff/package.json
  • apps/bff/src/datasources/redis.ts
  • apps/bff/src/lib/circuitBreaker.ts
  • apps/gateway/package.json
  • apps/gateway/src/lib/audit.ts
  • apps/gateway/src/lib/auth0-mgmt.ts
  • apps/gateway/src/middleware/planEnforcement.ts
  • apps/gateway/src/routes/billing.ts
  • apps/gateway/src/routes/devicesImport.ts
  • apps/gateway/src/routes/teamMembers.ts
  • apps/gateway/src/server.ts
  • apps/gateway/src/services/device.ts
  • apps/jobs-worker/src/connection.ts
  • apps/read-model-builder/cmd/main.go
  • apps/read-model-builder/internal/consumer/message_handler.go
  • apps/read-model-builder/internal/projection/device_projection.go
  • apps/read-model-builder/internal/projection/telemetry_projection.go
  • apps/saga-orchestrator/internal/domain/saga.go
  • apps/saga-orchestrator/internal/health/checker.go
  • apps/saga-orchestrator/internal/orchestrator/provision_saga.go
  • apps/saga-orchestrator/internal/repository/postgres_saga_repository.go
  • apps/search-indexer/main.py
  • apps/telemetry-service/migrations/000003_add_tenants_billing.down.sql
  • apps/telemetry-service/migrations/000003_add_tenants_billing.up.sql
  • apps/telemetry-service/migrations/000004_sso_alert_rules_bulk.down.sql
  • apps/telemetry-service/migrations/000004_sso_alert_rules_bulk.up.sql
  • go.mod
  • infra/kafka/mirrormaker2-connector.json
  • infra/terraform/backend.tf
  • infra/terraform/environments/dr/main.tf
  • infra/terraform/environments/dr/providers.tf
  • infra/terraform/environments/prod/main.tf
  • infra/terraform/environments/prod/providers.tf
  • infra/terraform/environments/prod/variables.tf
  • infra/terraform/modules/aurora-global/main.tf
  • infra/terraform/modules/elasticache-global/main.tf
  • libs/health/health.go
  • scripts/load-tests/performance-budget.js
  • tests/chaos/README.md
  • tests/chaos/kafka-consumer-pause.sh
  • tests/chaos/network-partition.yaml
  • tests/chaos/pod-kill.yaml
  • tests/chaos/projection-lag.sh
  • tests/chaos/redis-outage.sh
  • tests/chaos/run-all.sh
  • tests/e2e/auth.spec.ts
  • tests/e2e/billing.spec.ts
  • tests/e2e/devices.spec.ts
  • tests/e2e/fixtures/mockAuth.ts
  • tests/e2e/playwright.config.ts

Comment thread .github/workflows/chaos.yml Outdated
Comment thread .github/workflows/chaos.yml
Comment thread .github/workflows/perf.yml
Comment thread .github/workflows/terraform.yml Outdated
Comment on lines +23 to +25
strategy:
matrix:
env: [dev, prod]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

DR environment missing from plan matrix.

The matrix plans [dev, prod] but the PR also introduces infra/terraform/environments/dr/. Consider adding dr to the matrix or document why DR is excluded from CI.

🔧 Add DR to matrix
     strategy:
       matrix:
-        env: [dev, prod]
+        env: [dev, prod, dr]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
strategy:
matrix:
env: [dev, prod]
strategy:
matrix:
env: [dev, prod, dr]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/terraform.yml around lines 23 - 25, The CI matrix under
strategy.matrix.env currently lists only [dev, prod] but the repo adds
infra/terraform/environments/dr/, so update the matrix to include "dr" (i.e.,
change strategy.matrix.env to [dev, dr, prod]) or, if intentional, add a brief
comment/documentation in the workflow explaining why DR is excluded and/or add
conditional logic to skip DR; update the workflow's strategy.matrix.env entry
accordingly to ensure DR gets planned or document the exclusion.

Comment thread tests/chaos/projection-lag.sh
Comment on lines +77 to +79
panic_count=$(kubectl logs -n "$NAMESPACE" deploy/saga-orchestrator \
--since="${OUTAGE_SECONDS}s" 2>/dev/null \
| grep -c "panic\|FATAL\|unhandled" || true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Log window may miss early outage panics.

The --since="${OUTAGE_SECONDS}s" flag is evaluated mid-outage (after ~10s of sleep and fallback checks), so it looks back further than the actual outage duration at that point. This is likely acceptable for catching panics, but the comment could clarify the intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/redis-outage.sh` around lines 77 - 79, The panic_count log window
currently uses --since="${OUTAGE_SECONDS}s" in the kubectl logs call (captured
into the panic_count variable) but that flag is evaluated at the time the
command runs (mid-outage), so it actually looks further back than the outage
start; update the nearby comment to explicitly state this behavior and the
intent (i.e., we intentionally allow a larger lookback to capture early panic
messages) or, if you prefer strict outage-only logs, change the approach to
record a timestamp at outage start and use --since-time instead; reference the
panic_count assignment, the kubectl logs invocation, and the
OUTAGE_SECONDS/NAMESPACE variables when making the comment or code change.

Comment thread tests/chaos/redis-outage.sh Outdated
Comment thread tests/e2e/devices.spec.ts
Comment thread tests/e2e/devices.spec.ts
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment thread .github/workflows/chaos.yml
Comment thread .github/workflows/chaos.yml
Comment thread tests/chaos/redis-outage.sh
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

♻️ Duplicate comments (7)
go.mod (2)

21-23: ⚠️ Potential issue | 🟠 Major

Align OTel module versions to the same release (v1.42.0).

otel/sdk are v1.42.0 but otlptracegrpc is v1.41.0. This version skew can cause subtle compatibility issues in tracing wiring (libs/observability/tracing.go, Line 15-49).

Proposed fix
 	go.opentelemetry.io/otel v1.42.0
-	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.41.0
+	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.42.0
 	go.opentelemetry.io/otel/sdk v1.42.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` around lines 21 - 23, Update the OTel exporter module to match the
other OpenTelemetry modules: change the dependency version for
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc from v1.41.0 to
v1.42.0 in go.mod so all three modules (go.opentelemetry.io/otel,
go.opentelemetry.io/otel/sdk, and
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc) are aligned;
run go mod tidy afterwards and verify tracing initialization code in
libs/observability/tracing.go still builds and imports the otlptracegrpc package
without version conflicts.

24-24: ⚠️ Potential issue | 🔴 Critical

google.golang.org/grpc v1.79.2 is flagged with a critical auth bypass advisory.

Please upgrade to a patched gRPC release immediately to avoid shipping known vulnerable transport/auth behavior.

What is the first patched version for GHSA-p77j-4mvh-x3m3 (GO-2026-4762) in google.golang.org/grpc, and what upgrade version range is currently recommended?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 24, The go.mod currently pins a vulnerable
google.golang.org/grpc version (google.golang.org/grpc v1.79.2); update the
dependency to a patched release immediately—the first patched release is
v1.79.3, but you should upgrade to a non-vulnerable release series (recommended:
>= v1.80.0) and ensure your go.mod/go.sum are updated and tests re-run to verify
compatibility.
.github/workflows/terraform.yml (1)

23-25: ⚠️ Potential issue | 🟠 Major

DR environment excluded from CI despite prod dependency.

The matrix only includes [dev, prod], but the prod environment explicitly depends on DR infrastructure for Aurora Global Database and ElastiCache Global Datastore replication (see infra/terraform/environments/prod/main.tf). Excluding DR from CI validation means:

  • DR infrastructure drift won't be detected
  • Changes to prod's global database configuration won't be validated against DR
  • DR failover capability could be compromised without warning

Add dr to the matrix or document why DR is intentionally excluded and how DR infrastructure is validated.

🔧 Add DR to matrix
     strategy:
       matrix:
-        env: [dev, prod]
+        env: [dev, prod, dr]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/terraform.yml around lines 23 - 25, The CI matrix
currently defines strategy.matrix.env with only [dev, prod], which omits the DR
environment required by prod's dependent global resources; update the matrix
definition (strategy → matrix → env) to include "dr" so CI validates DR
alongside prod, or alternatively add a clear comment and documentation
explaining why "dr" is intentionally excluded and where/how DR infra is
validated; ensure the change references the strategy.matrix.env entry so
reviewers can find it easily.
apps/read-model-builder/cmd/main.go (1)

85-103: ⚠️ Potential issue | 🟠 Major

Sanitize REDIS_CLUSTER_NODES before using NewUniversalClient().

Blank entries are still appended, and after trimming a config like REDIS_CLUSTER_NODES=redis-0:6379, leaves one real address. In go-redis/v9, NewUniversalClient uses a standalone client when Addrs has a single entry, so this branch can log “Redis cluster mode” without actually creating a cluster-aware client. Filter empty nodes and either require multiple valid addresses here or switch the cluster branch to redis.NewClusterClient.

In github.com/redis/go-redis/v9, how does redis.NewUniversalClient choose between Client and ClusterClient when UniversalOptions.Addrs has one vs multiple addresses?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/cmd/main.go` around lines 85 - 103, The code
currently builds addrs from getenv("REDIS_CLUSTER_NODES") but doesn't filter out
empty/whitespace entries, so a trailing comma can leave a single valid address
and erroneously log "Redis cluster mode" while NewUniversalClient will create a
standalone client; update the logic that builds addrs (used by clusterNodes,
addrs, and redis.NewUniversalClient) to trim and skip empty strings when
splitting, then: if you truly want cluster mode only when multiple valid
addresses exist, require len(addrs) > 1 to log cluster mode and call
redis.NewClusterClient (or keep NewUniversalClient but only in multi-node case),
otherwise fall back to a single-node path (use the single addr or
redis.NewClient) and log single-node mode; ensure you reference clusterNodes,
addrs, redis.NewUniversalClient and consider switching to redis.NewClusterClient
when multiple valid addresses are present.
apps/gateway/src/routes/teamMembers.ts (1)

80-93: ⚠️ Potential issue | 🟠 Major

Return the actual invite outcome.

If the tenant has no auth0_org_id, or inviteToOrg() rejects, this branch still falls through to the existing 201 response with invited: true. That tells admins an email was sent when none was.

Suggested fix
-    pool.query("SELECT auth0_org_id, name FROM tenants WHERE id = $1", [tenantId])
-      .then((result: { rows: Array<{ auth0_org_id?: string; name?: string }> }) => {
-        const rows = result.rows;
-        if (!rows[0]?.auth0_org_id) return;
-        return inviteToOrg({
-          orgId:       rows[0].auth0_org_id,
-          email:       email.trim().toLowerCase(),
-          role:        memberRole,
-          inviterName: rows[0].name || "GrainGuard",
-        });
-      })
-      .catch((_e: unknown) =>
-        console.error("[team] inviteToOrg failed (non-fatal):", _e)
-      );
+    let invited = false;
+    try {
+      const result = await pool.query(
+        "SELECT auth0_org_id, name FROM tenants WHERE id = $1",
+        [tenantId]
+      );
+      const tenant = result.rows[0];
+      if (tenant?.auth0_org_id) {
+        await inviteToOrg({
+          orgId: tenant.auth0_org_id,
+          email: email.trim().toLowerCase(),
+          role: memberRole,
+          inviterName: tenant.name || "GrainGuard",
+        });
+        invited = true;
+      }
+    } catch (_e: unknown) {
+      console.error("[team] inviteToOrg failed (non-fatal):", _e);
+    }
@@
-      invited:     true,
+      invited,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/teamMembers.ts` around lines 80 - 93, The current
pool.query promise chain (in the teamMembers.ts block that calls inviteToOrg
with tenantId, email, memberRole) swallows failure and missing auth0_org_id
cases and always allows the route to return 201 invited:true; change this to
return the actual invite outcome: when rows[0]?.auth0_org_id is missing return
an explicit failure outcome (e.g., invited: false), and when calling inviteToOrg
await/propagate its result instead of just calling it in a then; also catch
inviteToOrg rejections and return invited: false (or an object indicating error)
so the route handler can use that returned value to set the proper 201/4xx
response and invited flag. Ensure the change is made in the promise chain that
currently invokes inviteToOrg (referencing pool.query, rows[0].auth0_org_id, and
inviteToOrg) so the caller receives and uses the real outcome.
apps/bff/src/lib/circuitBreaker.ts (1)

43-45: ⚠️ Potential issue | 🟠 Major

Tenant-scope this Redis key or move it out of the shared cache namespace.

cb:${this.name} is still global. In this repo, BFF cache keys must carry tenantId; otherwise different tenants can read/write the same Redis entry. If this breaker is intentionally global, store it outside the shared app cache namespace instead of reusing cache.

As per coding guidelines, Verify cache keys include tenantId.

Also applies to: 59-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/lib/circuitBreaker.ts` around lines 43 - 45, The Redis key
`cb:${this.name}` used in cache.get (and the corresponding cache.set usage
around lines 59-63) is currently global; update CircuitBreaker code to include
the tenant identifier in the key (e.g., `cb:${tenantId}:${this.name}`) or, if
the breaker must be truly global, move it off the shared `cache` and use the
dedicated global cache instance instead. Locate the usages of cache.get and
cache.set that reference `cb:${this.name}` and modify them to read the tenantId
from the request/context available to the class (or switch to the app-level
cache), ensuring every cache key includes tenantId when using the shared cache.
tests/chaos/projection-lag.sh (1)

58-60: ⚠️ Potential issue | 🟠 Major

Restore the original replica count on exit.

This still hard-codes 0 -> 1 and has no EXIT trap. If the script is interrupted, read-model-builder stays down, and a deployment that started with a different replica count is restored incorrectly.

Also applies to: 89-91

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/projection-lag.sh` around lines 58 - 60, The script currently
hard-codes scaling read-model-builder to 0 and back to 1 without an EXIT trap;
capture the current replica count first (e.g., query
deployment/read-model-builder replicas via kubectl to a variable), install a
trap on EXIT (and on INT/TERM) to restore that captured replica count, then use
that captured value when scaling back; apply the same pattern to the second
scaling block around lines 89-91 so both pauses restore the original replica
count even if the script is interrupted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/perf.yml:
- Around line 82-102: The BFF fails at module load because the env block in the
"Start BFF in background" step is missing required JWT configuration; add
JWKS_URL, JWT_ISSUER, and JWT_AUDIENCE environment variables to that step so
apps/bff's startup code (server.ts) can initialize; set JWKS_URL to a well-known
JWKS endpoint derived from AUTH0_DOMAIN (e.g.
https://placeholder.auth0.com/.well-known/jwks.json), set JWT_ISSUER to the
auth0 issuer URL (e.g. https://placeholder.auth0.com/), and set JWT_AUDIENCE to
match AUTH0_AUDIENCE (placeholder) or appropriate audience.
- Around line 114-119: The perf step never exercises secured endpoints because
scripts/load-tests/performance-budget.js skips /devices/:id/latest and /graphql
when JWT is empty; update the "Run performance budget" job so it provides a
non-empty JWT env var (e.g., set JWT via an action that obtains a test/service
token or inject a static test token) when invoking k6, ensuring the script sees
process.env.JWT and will exercise the protected routes; reference the Run
performance budget step and scripts/load-tests/performance-budget.js when making
the change.

In @.github/workflows/terraform.yml:
- Around line 1-11: Add a concurrency stanza to the Terraform GitHub Actions
jobs so plans for the same environment are queued instead of running
concurrently; under the Terraform job(s) in the workflow (the jobs block that
runs your plan/apply) add a concurrency key with a group that scopes to the
target environment/branch (for example use an expression like terraform-${{
github.ref }} or terraform-${{ github.head_ref }} to uniquely identify the
environment) and set cancel-in-progress: false to ensure jobs are queued rather
than cancelled.
- Around line 114-117: The "Terraform Apply" step should re-run a plan to detect
drift before applying; change the step to first run `terraform plan -input=false
-out=tfplan` (or `terraform plan -detailed-exitcode` if you want to detect and
fail on changes) and then run `terraform apply -input=false tfplan` instead of
applying directly, so the workflow uses a freshly generated plan and can detect
unexpected state drift before making changes; update the step named "Terraform
Apply" to perform these two commands and handle non-zero detailed-exitcode if
you want the job to fail when drift is detected.

In `@apps/bff/src/datasources/redis.ts`:
- Around line 46-47: getMany() currently fires client.get for every key via
Promise.all (see getMany and its use from resolvers.ts with deviceIds), which
can overwhelm Redis; change getMany to limit concurrency by batching or using a
concurrency limiter (e.g., process keys in chunks of a defined BATCH_SIZE or use
a p-limit-like pattern) so only a fixed number of client.get calls run
concurrently, preserve input ordering when assembling results, and continue to
JSON.parse non-null values and return null for misses; add a configurable
constant (e.g., BATCH_SIZE or MAX_CONCURRENCY) and ensure tests still expect the
same return shape.
- Around line 9-17: When building rootNodes from REDIS_CLUSTER_NODES before
calling createCluster(), reject blank or malformed entries: split on commas,
filter out entries that are empty after trim, then for each segment parse host
and port and validate that host is non-empty and port parses to a finite integer
within 1–65535 (or use default 6379 only if port is absent). If validation fails
for any node, surface a clear error (throw or log + exit) instead of producing a
node with empty host or NaN port; then map the validated entries into the same
socket shape used by rootNodes. Ensure the check is applied where
REDIS_CLUSTER_NODES and rootNodes are defined so createCluster() only receives
valid nodes.

In `@apps/bff/src/lib/circuitBreaker.ts`:
- Around line 46-50: syncFromRedis currently only adopts remote state when
shared.failureCount > this.failureCount which prevents pods from converging
(e.g. a CLOSED write with failureCount 0 is ignored by OPEN pods). Change the
comparison to use a monotonic version/timestamp field (add and persist a
shared.lastUpdated/version) and in syncFromRedis compare shared.lastUpdated >
this.lastUpdated; when true, copy shared.state, shared.failureCount,
shared.lastFailureTime and set this.lastUpdated to shared.lastUpdated so all
pods converge deterministically.

In `@apps/gateway/src/lib/audit.ts`:
- Around line 26-29: The AuditEventType union in apps/gateway/src/lib/audit.ts
contains duplicate string literals ("device.created", "device.creation_failed",
"device.registered", "webhook_endpoint.created") that are already declared again
under the // Devices and // Webhooks sections; remove these top-level duplicates
so each audit event literal appears only once (leave the canonical declarations
in the // Devices and // Webhooks blocks), and verify the symbol AuditEventType
still compiles after the cleanup.

In `@apps/gateway/src/server.ts`:
- Around line 191-199: The audit write is currently fire-and-forget via
logAuditEvent, risking lost entries if the process exits; change the callers
(the device creation path where logAuditEvent is invoked and the other
occurrences around the device route) to await the promise returned by
logAuditEvent (ensure the enclosing handler function is async) so the request
handler waits for writeAuditLog to complete; do the same for the other instance
referenced (the block at 203-210) to ensure all audit writes are awaited and any
internal errors remain swallowed by writeAuditLog.

In `@apps/telemetry-service/migrations/000009_sso_alert_rules_bulk.down.sql`:
- Around line 6-9: The tenants SSO columns (auth0_org_id, sso_connection_id,
sso_connection_type) are declared in both migrations 000007_saas_columns and
000009_sso_alert_rules_bulk, causing ambiguous ownership—remove the column ADDs
and their DROPs from the 000007 migration files (e.g.,
000007_saas_columns.up.sql and its down.sql) so these columns are only
created/dropped in 000009_sso_alert_rules_bulk.up.sql/down.sql (ensure the
auth0_org_id index and its removal remain in 000009 and update any
comments/commit messages to reflect single ownership).

In `@apps/telemetry-service/migrations/000009_sso_alert_rules_bulk.up.sql`:
- Around line 4-7: This migration duplicates SSO column additions already
created in migration 000007_saas_columns.up.sql; remove the three ALTER TABLE
ADD COLUMN IF NOT EXISTS statements for auth0_org_id, sso_connection_id, and
sso_connection_type from 000009_sso_alert_rules_bulk.up.sql so the columns
remain owned by 000007; after removing, ensure ALTER TABLE tenants in 000009
only contains the non-SSO changes (or any other intended column additions) and
verify no subsequent migration logic or tests rely on 000009 to create those SSO
columns.
- Around line 43-45: Add non-negative CHECK constraints to the count columns to
enforce data integrity: update the table definition (or an ALTER TABLE in this
migration) to add CHECK constraints that ensure total_rows >= 0, success_rows >=
0 and error_rows >= 0 (you can add three separate constraints or a single
combined CHECK). Reference the column names total_rows, success_rows and
error_rows (and the surrounding CREATE TABLE/ALTER TABLE statement in this
migration) and ensure constraint names are unique and descriptive so the DB will
reject any negative counts at insert/update time.

In `@scripts/load-tests/performance-budget.js`:
- Around line 73-75: The /health probe failures aren't counted because
error_rate and total_errors are only updated inside the JWT-protected branches;
update the same counters when handling healthRes so failures are included. In
the block around healthRes (variable healthRes, check call that asserts r.status
=== 200, and gatewayP95.add) add logic to increment total_errors and update
error_rate when healthRes.status !== 200 (mirror the JWT-branch error handling),
and ensure any tags/metrics used for error aggregation are the same as in the
JWT branches so CI runs without JWT still contribute to the error budget.

In `@tests/chaos/kafka-consumer-pause.sh`:
- Around line 18-19: The script only lists CONSUMERS=("read-model-builder"
"cdc-transformer") so it misses the sibling consumer group
"read-model-builder-devices" started by the read-model-builder binary; update
the CONSUMERS array to include "read-model-builder-devices" and ensure any
checks/loops that use CONSUMERS and ORIGINAL_REPLICAS operate over both entries
so the pause/unpause verification waits for both read-model-builder and
read-model-builder-devices to catch up.
- Around line 29-35: The current_lag() function is summing the wrong awk field
(using $NF which is CLIENT-ID) and thus returns incorrect lag; update the awk
invocation inside current_lag() to sum column 5 (the LAG field) instead of $NF
so the function computes total lag correctly (replace the awk expression that
uses $NF with one that uses $5).

In `@tests/chaos/projection-lag.sh`:
- Around line 19-21: Increase the alert timing windows to avoid false failures:
update ALERT_WINDOW and RECOVERY_WINDOW (and any hardcoded checks in the same
file range ~lines 64-85) to values that account for pod shutdown, rollout
completion, and lag accumulation before the `ProjectionLagHigh` alert's `for:
2m` expires; specifically enlarge ALERT_WINDOW from 120s to a value >= (shutdown
+ rollout + 2m + one evaluation interval) and extend RECOVERY_WINDOW similarly,
and ensure STRICT_ALERT_CHECK logic still gates strict verification but uses
these larger windows so CI doesn't fail prematurely.
- Around line 39-42: The alert_firing() helper currently returns true for any
ProjectionLagHigh alert; update it to scope to the read-model-builder instance
by verifying the alert also contains the label for this experiment (e.g., the
job/instance/service label used in the Prometheus rule). Modify alert_firing to
fetch ${PROMETHEUS_URL}/api/v1/alerts and check both
'"alertname":"ProjectionLagHigh"' and the experiment-specific label (for example
'"job":"read-model-builder"' or the actual label name used in slo-rules.yaml) —
use a combined check (jq or two greps with &&) so only ProjectionLagHigh alerts
for read-model-builder make the function return success.
- Around line 31-36: The current_lag() function is summing the wrong awk field
(using $NF which is CLIENT-ID); update the awk expression in current_lag() to
sum the LAG column (column 6) instead: replace the awk script with one that uses
$6 (e.g., 'NR>1 { sum += $6 } END { print sum+0 }') so the function returns the
actual consumer lag used by the pre-check and recovery assertions.

In `@tests/chaos/redis-outage.sh`:
- Around line 117-122: The warm-up probe measures elapsed_ms but never asserts
it meets the 30s SLO; after calling http_check "Cache warm-up probe" and
computing elapsed_ms, add an assertion that elapsed_ms is <= 30000 (30,000 ms)
and fail the script if it exceeds the SLO (e.g., call an existing fail function
or exit 1). Update the block around t_start, http_check, t_end, elapsed_ms, and
log to perform the check and produce a clear error message including elapsed_ms
when the SLO is violated.
- Around line 27-41: The http_check function currently treats any HTTP 200 as
success but discards the response body; update http_check (the function using
GATEWAY_URL, GRAPHQL_QUERY and TEST_JWT) to capture the response body and HTTP
status, then parse the JSON to assert GraphQL-level success (e.g., ensure no
"errors" field and that "data" is present/non-null) before reporting success; if
JSON parsing fails or errors exist, log the body and return failure so the
script won't falsely report a successful GraphQL query.

---

Duplicate comments:
In @.github/workflows/terraform.yml:
- Around line 23-25: The CI matrix currently defines strategy.matrix.env with
only [dev, prod], which omits the DR environment required by prod's dependent
global resources; update the matrix definition (strategy → matrix → env) to
include "dr" so CI validates DR alongside prod, or alternatively add a clear
comment and documentation explaining why "dr" is intentionally excluded and
where/how DR infra is validated; ensure the change references the
strategy.matrix.env entry so reviewers can find it easily.

In `@apps/bff/src/lib/circuitBreaker.ts`:
- Around line 43-45: The Redis key `cb:${this.name}` used in cache.get (and the
corresponding cache.set usage around lines 59-63) is currently global; update
CircuitBreaker code to include the tenant identifier in the key (e.g.,
`cb:${tenantId}:${this.name}`) or, if the breaker must be truly global, move it
off the shared `cache` and use the dedicated global cache instance instead.
Locate the usages of cache.get and cache.set that reference `cb:${this.name}`
and modify them to read the tenantId from the request/context available to the
class (or switch to the app-level cache), ensuring every cache key includes
tenantId when using the shared cache.

In `@apps/gateway/src/routes/teamMembers.ts`:
- Around line 80-93: The current pool.query promise chain (in the teamMembers.ts
block that calls inviteToOrg with tenantId, email, memberRole) swallows failure
and missing auth0_org_id cases and always allows the route to return 201
invited:true; change this to return the actual invite outcome: when
rows[0]?.auth0_org_id is missing return an explicit failure outcome (e.g.,
invited: false), and when calling inviteToOrg await/propagate its result instead
of just calling it in a then; also catch inviteToOrg rejections and return
invited: false (or an object indicating error) so the route handler can use that
returned value to set the proper 201/4xx response and invited flag. Ensure the
change is made in the promise chain that currently invokes inviteToOrg
(referencing pool.query, rows[0].auth0_org_id, and inviteToOrg) so the caller
receives and uses the real outcome.

In `@apps/read-model-builder/cmd/main.go`:
- Around line 85-103: The code currently builds addrs from
getenv("REDIS_CLUSTER_NODES") but doesn't filter out empty/whitespace entries,
so a trailing comma can leave a single valid address and erroneously log "Redis
cluster mode" while NewUniversalClient will create a standalone client; update
the logic that builds addrs (used by clusterNodes, addrs, and
redis.NewUniversalClient) to trim and skip empty strings when splitting, then:
if you truly want cluster mode only when multiple valid addresses exist, require
len(addrs) > 1 to log cluster mode and call redis.NewClusterClient (or keep
NewUniversalClient but only in multi-node case), otherwise fall back to a
single-node path (use the single addr or redis.NewClient) and log single-node
mode; ensure you reference clusterNodes, addrs, redis.NewUniversalClient and
consider switching to redis.NewClusterClient when multiple valid addresses are
present.

In `@go.mod`:
- Around line 21-23: Update the OTel exporter module to match the other
OpenTelemetry modules: change the dependency version for
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc from v1.41.0 to
v1.42.0 in go.mod so all three modules (go.opentelemetry.io/otel,
go.opentelemetry.io/otel/sdk, and
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc) are aligned;
run go mod tidy afterwards and verify tracing initialization code in
libs/observability/tracing.go still builds and imports the otlptracegrpc package
without version conflicts.
- Line 24: The go.mod currently pins a vulnerable google.golang.org/grpc version
(google.golang.org/grpc v1.79.2); update the dependency to a patched release
immediately—the first patched release is v1.79.3, but you should upgrade to a
non-vulnerable release series (recommended: >= v1.80.0) and ensure your
go.mod/go.sum are updated and tests re-run to verify compatibility.

In `@tests/chaos/projection-lag.sh`:
- Around line 58-60: The script currently hard-codes scaling read-model-builder
to 0 and back to 1 without an EXIT trap; capture the current replica count first
(e.g., query deployment/read-model-builder replicas via kubectl to a variable),
install a trap on EXIT (and on INT/TERM) to restore that captured replica count,
then use that captured value when scaling back; apply the same pattern to the
second scaling block around lines 89-91 so both pauses restore the original
replica count even if the script is interrupted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0734efe7-193a-4f1b-92eb-ef32aed874a5

📥 Commits

Reviewing files that changed from the base of the PR and between 9131353 and 8a97319.

⛔ Files ignored due to path filters (3)
  • apps/gateway/package-lock.json is excluded by !**/package-lock.json
  • apps/jobs-worker/package-lock.json is excluded by !**/package-lock.json
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (22)
  • .github/workflows/chaos.yml
  • .github/workflows/perf.yml
  • .github/workflows/terraform.yml
  • apps/bff/src/datasources/redis.ts
  • apps/bff/src/lib/circuitBreaker.ts
  • apps/gateway/package.json
  • apps/gateway/src/lib/audit.ts
  • apps/gateway/src/routes/teamMembers.ts
  • apps/gateway/src/server.ts
  • apps/read-model-builder/cmd/main.go
  • apps/telemetry-service/migrations/000007_saas_columns.down.sql
  • apps/telemetry-service/migrations/000007_saas_columns.up.sql
  • apps/telemetry-service/migrations/000009_sso_alert_rules_bulk.down.sql
  • apps/telemetry-service/migrations/000009_sso_alert_rules_bulk.up.sql
  • apps/workflow-alerts/src/index.ts
  • go.mod
  • scripts/load-tests/performance-budget.js
  • tests/chaos/kafka-consumer-pause.sh
  • tests/chaos/projection-lag.sh
  • tests/chaos/redis-outage.sh
  • tests/e2e/devices.spec.ts
  • tests/e2e/playwright.config.ts

Comment on lines +82 to +102
- name: Start BFF in background
run: npx ts-node src/server.ts &
working-directory: apps/bff
env:
PORT: 4000
POSTGRES_HOST: localhost
POSTGRES_PORT: 5432
POSTGRES_USER: grainguard
POSTGRES_PASSWORD: grainguard
POSTGRES_DB: grainguard
REDIS_HOST: localhost
REDIS_PORT: 6379
ELASTICSEARCH_URL: http://localhost:9200
CASSANDRA_HOST: localhost
CASSANDRA_PORT: 9042
AUTH0_DOMAIN: placeholder.auth0.com
AUTH0_AUDIENCE: placeholder
AUTH0_ORG_CLAIM: org_id
ALLOWED_ORIGINS: http://localhost:5173
JWT_SECRET: dev-secret

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

BFF won't boot with this env block.

apps/bff/src/server.ts throws during module load unless JWKS_URL, JWT_ISSUER, and JWT_AUDIENCE are set. This step only provides JWT_SECRET/AUTH0_AUDIENCE, so Wait for BFF never gets a live /graphql endpoint.

Suggested fix
       - name: Start BFF in background
         run: npx ts-node src/server.ts &
         working-directory: apps/bff
         env:
           PORT: 4000
+          JWKS_URL: ${{ secrets.PERF_JWKS_URL }}
+          JWT_ISSUER: ${{ secrets.PERF_JWT_ISSUER }}
+          JWT_AUDIENCE: ${{ secrets.PERF_JWT_AUDIENCE }}
           POSTGRES_HOST: localhost
           POSTGRES_PORT: 5432
           POSTGRES_USER: grainguard
           POSTGRES_PASSWORD: grainguard
           POSTGRES_DB: grainguard
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/perf.yml around lines 82 - 102, The BFF fails at module
load because the env block in the "Start BFF in background" step is missing
required JWT configuration; add JWKS_URL, JWT_ISSUER, and JWT_AUDIENCE
environment variables to that step so apps/bff's startup code (server.ts) can
initialize; set JWKS_URL to a well-known JWKS endpoint derived from AUTH0_DOMAIN
(e.g. https://placeholder.auth0.com/.well-known/jwks.json), set JWT_ISSUER to
the auth0 issuer URL (e.g. https://placeholder.auth0.com/), and set JWT_AUDIENCE
to match AUTH0_AUDIENCE (placeholder) or appropriate audience.

Comment on lines +114 to +119
- name: Run performance budget
run: |
k6 run \
--env GATEWAY_URL=http://localhost:3000 \
--env BFF_URL=http://localhost:4000 \
scripts/load-tests/performance-budget.js
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The perf step never exercises the secured endpoints.

scripts/load-tests/performance-budget.js skips both /devices/:id/latest and /graphql when JWT is empty, and this step never passes one. Right now the workflow only measures /health, so BFF latency and auth-protected gateway regressions are invisible.

Suggested fix
       - name: Run performance budget
         run: |
           k6 run \
             --env GATEWAY_URL=http://localhost:3000 \
             --env BFF_URL=http://localhost:4000 \
+            --env JWT=${{ secrets.PERF_JWT }} \
             scripts/load-tests/performance-budget.js
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/perf.yml around lines 114 - 119, The perf step never
exercises secured endpoints because scripts/load-tests/performance-budget.js
skips /devices/:id/latest and /graphql when JWT is empty; update the "Run
performance budget" job so it provides a non-empty JWT env var (e.g., set JWT
via an action that obtains a test/service token or inject a static test token)
when invoking k6, ensuring the script sees process.env.JWT and will exercise the
protected routes; reference the Run performance budget step and
scripts/load-tests/performance-budget.js when making the change.

Comment on lines +1 to +11
name: Terraform

on:
pull_request:
branches: [master]
paths:
- "infra/terraform/**"
push:
branches: [master]
paths:
- "infra/terraform/**"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add concurrency control to prevent Terraform state lock conflicts.

Multiple PRs modifying Terraform files will trigger simultaneous plan jobs for the same environment, causing "Error acquiring the state lock" failures when Terraform attempts to acquire the DynamoDB lock. This creates intermittent failures and poor developer experience.

Add a concurrency key to queue jobs per environment instead of failing them.

🔒 Add concurrency control
 name: Terraform
+
+concurrency:
+  group: terraform-${{ github.workflow }}-${{ matrix.env || 'apply-dev' }}
+  cancel-in-progress: false

 on:
   pull_request:

Note: The cancel-in-progress: false ensures queuing rather than cancellation, which is important for Terraform operations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
name: Terraform
on:
pull_request:
branches: [master]
paths:
- "infra/terraform/**"
push:
branches: [master]
paths:
- "infra/terraform/**"
name: Terraform
concurrency:
group: terraform-${{ github.workflow }}-${{ matrix.env || 'apply-dev' }}
cancel-in-progress: false
on:
pull_request:
branches: [master]
paths:
- "infra/terraform/**"
push:
branches: [master]
paths:
- "infra/terraform/**"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/terraform.yml around lines 1 - 11, Add a concurrency
stanza to the Terraform GitHub Actions jobs so plans for the same environment
are queued instead of running concurrently; under the Terraform job(s) in the
workflow (the jobs block that runs your plan/apply) add a concurrency key with a
group that scopes to the target environment/branch (for example use an
expression like terraform-${{ github.ref }} or terraform-${{ github.head_ref }}
to uniquely identify the environment) and set cancel-in-progress: false to
ensure jobs are queued rather than cancelled.

Comment on lines +114 to +117
- name: Terraform Apply
env:
TF_VAR_db_password: ${{ secrets.TF_VAR_DB_PASSWORD }}
run: terraform apply -input=false -auto-approve
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider re-planning before apply to detect state drift.

The apply job uses the plan generated during the PR, but state might have changed if other PRs merged in the meantime. Re-running terraform plan before apply would detect drift and prevent unexpected changes.

🔍 Add drift detection
       - name: Terraform Init
         run: terraform init -input=false
 
+      - name: Terraform Plan (verify no drift)
+        env:
+          TF_VAR_db_password: ${{ secrets.TF_VAR_DB_PASSWORD }}
+        run: terraform plan -input=false -no-color
+
       - name: Terraform Apply
         env:
           TF_VAR_db_password: ${{ secrets.TF_VAR_DB_PASSWORD }}
         run: terraform apply -input=false -auto-approve
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/terraform.yml around lines 114 - 117, The "Terraform
Apply" step should re-run a plan to detect drift before applying; change the
step to first run `terraform plan -input=false -out=tfplan` (or `terraform plan
-detailed-exitcode` if you want to detect and fail on changes) and then run
`terraform apply -input=false tfplan` instead of applying directly, so the
workflow uses a freshly generated plan and can detect unexpected state drift
before making changes; update the step named "Terraform Apply" to perform these
two commands and handle non-zero detailed-exitcode if you want the job to fail
when drift is detected.

Comment on lines +9 to +17
const rootNodes = REDIS_CLUSTER_NODES.split(",").map((n) => {
const [host, port] = n.trim().split(":");
return {
socket: {
host,
port: parseInt(port || "6379"),
},
};
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject blank or malformed cluster nodes before createCluster().

A trailing comma or host:not-a-port currently produces a node with an empty host or NaN port, which turns a simple env typo into a runtime connection failure. Filter empty entries and validate the parsed port before building rootNodes.

Suggested fix
-    const rootNodes = REDIS_CLUSTER_NODES.split(",").map((n) => {
-      const [host, port] = n.trim().split(":");
-      return {
-        socket: {
-          host,
-          port: parseInt(port || "6379"),
-        },
-      };
-    });
+    const rootNodes = REDIS_CLUSTER_NODES.split(",")
+      .map((n) => n.trim())
+      .filter(Boolean)
+      .map((n) => {
+        const [host, portText] = n.split(":");
+        const port = Number(portText || "6379");
+        if (!host || !Number.isInteger(port) || port <= 0) {
+          throw new Error(`Invalid REDIS_CLUSTER_NODES entry: "${n}"`);
+        }
+        return {
+          socket: { host, port },
+        };
+      });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/redis.ts` around lines 9 - 17, When building
rootNodes from REDIS_CLUSTER_NODES before calling createCluster(), reject blank
or malformed entries: split on commas, filter out entries that are empty after
trim, then for each segment parse host and port and validate that host is
non-empty and port parses to a finite integer within 1–65535 (or use default
6379 only if port is absent). If validation fails for any node, surface a clear
error (throw or log + exit) instead of producing a node with empty host or NaN
port; then map the validated entries into the same socket shape used by
rootNodes. Ensure the check is applied where REDIS_CLUSTER_NODES and rootNodes
are defined so createCluster() only receives valid nodes.

Comment on lines +19 to +21
ALERT_WINDOW="${ALERT_WINDOW:-120}" # seconds to wait for alert to fire
RECOVERY_WINDOW="${RECOVERY_WINDOW:-300}" # seconds to wait for lag to drop
STRICT_ALERT_CHECK="${STRICT_ALERT_CHECK:-0}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The alert deadline is too short for a real firing alert.

ProjectionLagHigh has for: 2m, and this countdown starts immediately after kubectl scale --replicas=0 returns, before the rollout finishes and before lag exceeds 5000. With STRICT_ALERT_CHECK=1 in the workflow, this is prone to false CI failures. The window needs slack for shutdown, lag buildup, and at least one evaluation interval.

Also applies to: 64-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/projection-lag.sh` around lines 19 - 21, Increase the alert
timing windows to avoid false failures: update ALERT_WINDOW and RECOVERY_WINDOW
(and any hardcoded checks in the same file range ~lines 64-85) to values that
account for pod shutdown, rollout completion, and lag accumulation before the
`ProjectionLagHigh` alert's `for: 2m` expires; specifically enlarge ALERT_WINDOW
from 120s to a value >= (shutdown + rollout + 2m + one evaluation interval) and
extend RECOVERY_WINDOW similarly, and ensure STRICT_ALERT_CHECK logic still
gates strict verification but uses these larger windows so CI doesn't fail
prematurely.

Comment on lines +31 to +36
current_lag() {
kubectl exec -n "$NAMESPACE" deploy/kafka -- \
kafka-consumer-groups.sh \
--bootstrap-server "$KAFKA_BOOTSTRAP" \
--describe --group "$CONSUMER_GROUP" 2>/dev/null \
| awk 'NR>1 && $NF~/[0-9]+/ { sum += $NF } END { print sum+0 }'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

kubectl exec -n "${NAMESPACE:-grainguard-dev}" deploy/kafka -- \
  kafka-consumer-groups.sh \
    --bootstrap-server "${KAFKA_BOOTSTRAP:-kafka:9092}" \
    --describe --group "${CONSUMER_GROUP:-read-model-builder}" 2>/dev/null \
| awk '
  NR==1 { for (i = 1; i <= NF; i++) print i ":" $i; next }
  NR==2 { print "lag-column=" $5 ", last-field=" $NF }
'

Repository: pahuldeepp/GrainGuard-

Length of output: 50


🏁 Script executed:

cat -n tests/chaos/projection-lag.sh

Repository: pahuldeepp/GrainGuard-

Length of output: 4743


🏁 Script executed:

# Search for any Kafka consumer group command usage in the codebase
rg "kafka-consumer-groups" -A 3 -B 1

Repository: pahuldeepp/GrainGuard-

Length of output: 9797


🏁 Script executed:

# Check if there's any documentation about Kafka command output format
fd -t f -e md -e txt -e rst | xargs rg "kafka-consumer-groups|consumer.*lag" 2>/dev/null || true

Repository: pahuldeepp/GrainGuard-

Length of output: 2297


Fix column parsing in current_lag() — using $NF returns CLIENT-ID, not LAG.

kafka-consumer-groups.sh --describe output has columns: GROUP, TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG (col 6), CONSUMER-ID, HOST, CLIENT-ID. The current awk uses $NF (last column, CLIENT-ID), which is empty or non-numeric. This makes current_lag() always return 0, invalidating both the pre-check and recovery assertions.

Update the awk to sum column 6: awk 'NR>1 { sum += $6 } END { print sum+0 }'

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/projection-lag.sh` around lines 31 - 36, The current_lag()
function is summing the wrong awk field (using $NF which is CLIENT-ID); update
the awk expression in current_lag() to sum the LAG column (column 6) instead:
replace the awk script with one that uses $6 (e.g., 'NR>1 { sum += $6 } END {
print sum+0 }') so the function returns the actual consumer lag used by the
pre-check and recovery assertions.

Comment on lines +39 to +42
alert_firing() {
# Returns 0 (true) if ProjectionLagHigh alert is active in Alertmanager
curl -s "${PROMETHEUS_URL}/api/v1/alerts" 2>/dev/null \
| grep -q '"alertname":"ProjectionLagHigh"'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope the alert check to this experiment.

This grep passes as soon as any ProjectionLagHigh alert exists in Prometheus. The rule in infra/monitoring/prometheus/slo-rules.yaml:181-193 also covers cdc-transformer, so unrelated lag there can make this script green even if read-model-builder never triggered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/projection-lag.sh` around lines 39 - 42, The alert_firing()
helper currently returns true for any ProjectionLagHigh alert; update it to
scope to the read-model-builder instance by verifying the alert also contains
the label for this experiment (e.g., the job/instance/service label used in the
Prometheus rule). Modify alert_firing to fetch ${PROMETHEUS_URL}/api/v1/alerts
and check both '"alertname":"ProjectionLagHigh"' and the experiment-specific
label (for example '"job":"read-model-builder"' or the actual label name used in
slo-rules.yaml) — use a combined check (jq or two greps with &&) so only
ProjectionLagHigh alerts for read-model-builder make the function return
success.

Comment on lines +27 to +41
http_check() {
local label="$1"
local status
status=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST "$GATEWAY_URL/graphql" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TEST_JWT:-dummy-jwt}" \
-d "$GRAPHQL_QUERY" 2>/dev/null || echo "000")
if [[ "$status" == "200" ]]; then
log " ✓ $label — HTTP $status"
else
warn " ✗ $label — HTTP $status"
return 1
fi
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't treat bare HTTP 200 as GraphQL success.

GraphQL can return HTTP 200 with an errors payload and no usable data. Because http_check() discards the body, this script can report "DB fallback works" even when the query failed. Assert on the JSON payload as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/redis-outage.sh` around lines 27 - 41, The http_check function
currently treats any HTTP 200 as success but discards the response body; update
http_check (the function using GATEWAY_URL, GRAPHQL_QUERY and TEST_JWT) to
capture the response body and HTTP status, then parse the JSON to assert
GraphQL-level success (e.g., ensure no "errors" field and that "data" is
present/non-null) before reporting success; if JSON parsing fails or errors
exist, log the body and return failure so the script won't falsely report a
successful GraphQL query.

Comment on lines +117 to +122
# Warm-up check: second request should be cache-hit (fast)
t_start=$(date +%s%N)
http_check "Cache warm-up probe"
t_end=$(date +%s%N)
elapsed_ms=$(( (t_end - t_start) / 1000000 ))
log " Response time after restore: ${elapsed_ms}ms"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The cache warm-up SLO is only logged, never asserted.

This probe records elapsed_ms but never fails if warm-up takes longer than the advertised 30s. As written, the experiment still passes when the post-restore cache is cold minutes later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/redis-outage.sh` around lines 117 - 122, The warm-up probe
measures elapsed_ms but never asserts it meets the 30s SLO; after calling
http_check "Cache warm-up probe" and computing elapsed_ms, add an assertion that
elapsed_ms is <= 30000 (30,000 ms) and fail the script if it exceeds the SLO
(e.g., call an existing fail function or exit 1). Update the block around
t_start, http_check, t_end, elapsed_ms, and log to perform the check and produce
a clear error message including elapsed_ms when the SLO is violated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/security.yml:
- Around line 77-86: The workflow step "Scan filesystem for secrets and
misconfigs" uses aquasecurity/trivy-action@0.28.0 with exit-code: "0", which
silences failures; update that step to fail the job on detected issues by
removing the forced "0" or setting exit-code to a non-zero value (e.g., "1") or
enabling the action's built-in fail-on-severity behavior so that the specified
severity list (CRITICAL,HIGH) will cause the job to fail; locate the step by its
name or the uses line and modify the exit-code/config to ensure secrets or
critical misconfigs block the workflow.
- Around line 52-57: The current "Ensure Trivy SARIF exists" step
unconditionally writes an empty SARIF file which masks real scan failures when
combined with exit-code: "0"; update this step to check the previous Trivy step
result and, if that step failed, generate a SARIF containing a
toolExecutionNotification/error entry (instead of an empty runs array) or fail
the job so failures are visible; specifically modify the logic that creates
"trivy-${{ matrix.service }}.sarif" to inspect the prior step outcome (or the
Trivy exit code) and write a minimal SARIF with an error notification
referencing Trivy, or exit non-zero to avoid silently uploading an empty SARIF.
- Around line 41-50: Update the Trivy action and tighten failure behavior:
change the action reference aquasecurity/trivy-action@0.28.0 to
aquasecurity/trivy-action@0.35.0 and set the workflow input exit-code from "0"
to "1" so CRITICAL/HIGH findings fail the job; after updating, rotate any CI/CD
secrets that may have been exposed by runs using the compromised versions
(0.0.1–0.34.2) to fully mitigate the supply-chain risk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3256396b-46ae-4c41-9c3e-b69f18ecbbe6

📥 Commits

Reviewing files that changed from the base of the PR and between 8a97319 and f53b589.

📒 Files selected for processing (3)
  • .github/workflows/cd.yml
  • .github/workflows/ci.yml
  • .github/workflows/security.yml

Comment on lines 41 to 50
- name: Run Trivy vulnerability scan
uses: aquasecurity/trivy-action@master
uses: aquasecurity/trivy-action@0.28.0
with:
image-ref: grainguard/${{ matrix.service }}:${{ github.sha }}
format: sarif
output: trivy-${{ matrix.service }}.sarif
severity: CRITICAL,HIGH
ignore-unfixed: true
exit-code: "0"
trivy-config: infra/security/trivy.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

aquasecurity trivy-action 0.28.0 release

💡 Result:

The aquasecurity/trivy-action v0.28.0 was released around late 2024. It includes updates such as bumping the bundled setup-trivy dependency to v0.2.1 (merged October 15, 2024 via PR #411). This version of Trivy Action was referenced in external repositories' dependency updates (e.g., PRs bumping from 0.27.0 to 0.28.0 in October 2024) and appeared in README examples prior to updates in early 2025 (PR #447 in January 2025 changing Trivy version from 0.28.0 to 0.29.0). No specific release notes or changelog for v0.28.0 were found on the official releases page, which highlights more recent versions up to v0.35.0 (March 2026). The repository has 74 releases total, confirming v0.28.0 exists as part of the sequence. Note: Due to a recent supply chain compromise affecting many tags (March 2026), avoid using version tags; pin to commit SHA 57a97c7e7821a5776cebc9bb87c984fa69cba8f1 or v0.35.0 instead.

Citations:


🌐 Web query:

trivy-action exit-code behavior CRITICAL HIGH vulnerabilities

💡 Result:

The trivy-action's exit-code behavior for CRITICAL and HIGH vulnerabilities works as follows: By default, Trivy exits with code 0 even if vulnerabilities are found. Set exit-code: '1' in the action inputs to exit with a non-zero code (1) when vulnerabilities at or above the specified severity threshold are detected. Combine with severity: 'CRITICAL,HIGH' to fail the workflow only on CRITICAL or HIGH vulnerabilities. Example GitHub Actions workflow step: - name: Run Trivy vulnerability scanner uses: aquasecurity/trivy-action@master with: image-ref: 'docker.io/my-organization/my-app:${{ github.sha }}' format: 'table' severity: 'CRITICAL,HIGH' exit-code: '1' ignore-unfixed: true vuln-type: 'os,library' This scans the image and fails (exit code 1) if CRITICAL or HIGH vulnerabilities are found, causing the GitHub Actions job to fail. Known issues (mostly fixed in recent versions like v0.35.0): - Older versions (e.g., v0.29.0) had bugs where severity filtering was not respected during scanning (debug logs showed all severities parsed despite config). - SARIF format historically output all vulnerabilities regardless of severity, potentially triggering exit-code incorrectly unless limit-severities-for-sarif: true was set (strips report). - Environment variable leakage between multiple trivy-action calls in the same job (fixed by PR #454, using temp file instead of GITHUB_ENV). Use the latest version (v0.35.0 as of 2026-03-20) and pin to a tag like @v0.35.0 for stability. For SARIF uploads to GitHub Code Scanning even on failure, use if: always on the upload step. Trivy CLI underlying behavior (passed via action): --exit-code 1 combined with --severity CRITICAL,HIGH exits 1 only if matching severities found; 0 otherwise.

Citations:


🌐 Web query:

aquasecurity trivy-action supply chain compromise March 2026 affected versions

💡 Result:

In March 2026, specifically on March 19, Aqua Security's trivy-action GitHub Action was compromised in a supply chain attack (GHSA-69fq-xp46-6x23, CVE-2026-33634). Attackers with compromised credentials force-pushed 76 out of 77 version tags (versions <0.35.0, specifically 0.0.1 to 0.34.2) to malicious commits containing credential-stealing malware. The attack lasted approximately 12 hours (2026-03-19 ~17:43 UTC to 2026-03-20 ~05:40 UTC). This was part of a broader incident also affecting setup-trivy (<0.2.6) and Trivy binary v0.69.4. Affected trivy-action versions: All tags except v0.35.0 (0.0.1–0.34.2). Safe/patched version: v0.35.0 (protected by GitHub's immutable releases feature). Users are affected if they pinned to affected tags or SHAs prior to remediation during the exposure window. Recommendation: Update to v0.35.0 or later, pin to full commit SHA, rotate all CI/CD secrets from affected runs.

Citations:


Update trivy-action to v0.35.0 and change exit-code to "1" to prevent both undetected vulnerabilities and supply chain compromise.

The exit-code: "0" allows CRITICAL and HIGH severity vulnerabilities to be reported to the Security tab but not fail the workflow, enabling vulnerable code to merge undetected. Additionally, version 0.28.0 is affected by a March 2026 supply chain attack (GHSA-69fq-xp46-6x23) where credential-stealing malware was injected into versions 0.0.1–0.34.2.

Update to the safe version v0.35.0 and set exit-code: "1" to fail the workflow on findings. Also rotate any CI/CD secrets from workflow runs using the compromised version.

🔒 Proposed fix
      - name: Run Trivy vulnerability scan
-       uses: aquasecurity/trivy-action@0.28.0
+       uses: aquasecurity/trivy-action@0.35.0
        with:
          image-ref: grainguard/${{ matrix.service }}:${{ github.sha }}
          format: sarif
          output: trivy-${{ matrix.service }}.sarif
          severity: CRITICAL,HIGH
          ignore-unfixed: true
-         exit-code: "0"
+         exit-code: "1"
          trivy-config: infra/security/trivy.yaml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/security.yml around lines 41 - 50, Update the Trivy action
and tighten failure behavior: change the action reference
aquasecurity/trivy-action@0.28.0 to aquasecurity/trivy-action@0.35.0 and set the
workflow input exit-code from "0" to "1" so CRITICAL/HIGH findings fail the job;
after updating, rotate any CI/CD secrets that may have been exposed by runs
using the compromised versions (0.0.1–0.34.2) to fully mitigate the supply-chain
risk.

Comment on lines +52 to +57
- name: Ensure Trivy SARIF exists
if: always()
run: |
if [ ! -f "trivy-${{ matrix.service }}.sarif" ]; then
printf '{"version":"2.1.0","runs":[]}\n' > "trivy-${{ matrix.service }}.sarif"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Defensive fallback is reasonable, but masks scan failures when combined with exit-code: "0".

The empty SARIF fallback ensures the upload step doesn't fail on missing output. This is acceptable as defensive coding, but note that it further obscures scan issues when paired with exit-code: "0".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/security.yml around lines 52 - 57, The current "Ensure
Trivy SARIF exists" step unconditionally writes an empty SARIF file which masks
real scan failures when combined with exit-code: "0"; update this step to check
the previous Trivy step result and, if that step failed, generate a SARIF
containing a toolExecutionNotification/error entry (instead of an empty runs
array) or fail the job so failures are visible; specifically modify the logic
that creates "trivy-${{ matrix.service }}.sarif" to inspect the prior step
outcome (or the Trivy exit code) and write a minimal SARIF with an error
notification referencing Trivy, or exit non-zero to avoid silently uploading an
empty SARIF.

Comment on lines 77 to 86
- name: Scan filesystem for secrets and misconfigs
uses: aquasecurity/trivy-action@master
uses: aquasecurity/trivy-action@0.28.0
with:
scan-type: fs
scan-ref: .
format: sarif
output: trivy-fs.sarif
severity: CRITICAL,HIGH
exit-code: "0"
trivy-config: infra/security/trivy.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same exit-code: "0" concern applies here.

Filesystem scans for secrets and misconfigs with exit-code: "0" means leaked secrets or critical misconfigurations won't block the workflow. This is particularly concerning for secrets scanning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/security.yml around lines 77 - 86, The workflow step "Scan
filesystem for secrets and misconfigs" uses aquasecurity/trivy-action@0.28.0
with exit-code: "0", which silences failures; update that step to fail the job
on detected issues by removing the forced "0" or setting exit-code to a non-zero
value (e.g., "1") or enabling the action's built-in fail-on-severity behavior so
that the specified severity list (CRITICAL,HIGH) will cause the job to fail;
locate the step by its name or the uses line and modify the exit-code/config to
ensure secrets or critical misconfigs block the workflow.

@pahuldeepp
Copy link
Copy Markdown
Owner Author

Closing — this Claude branch is superseded by PRs #9, #10, and #11 which completed all the planned fixes.

@pahuldeepp pahuldeepp closed this Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant