Skip to content

improvement(metrics): emit hosted-key metrics to CloudWatch instead of OTel#4914

Merged
TheodoreSpeaks merged 2 commits into
stagingfrom
feat/hosted-key-cloudwatch-metrics
Jun 9, 2026
Merged

improvement(metrics): emit hosted-key metrics to CloudWatch instead of OTel#4914
TheodoreSpeaks merged 2 commits into
stagingfrom
feat/hosted-key-cloudwatch-metrics

Conversation

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator

Summary

  • Replace the OTel/Prometheus backend for hosted-key metrics with CloudWatch (Sim/HostedKey namespace) — same hostedKeyMetrics facade, so no call-site changes in tools/index.ts / lib/core/telemetry.ts.
  • Fixes broken aggregation from ephemeral trigger.dev workers: CloudWatch aggregates pushed values server-side, so one-shot worker processes no longer collide/clobber the way cumulative Prometheus counters did (no per-process series collisions, no reset math).
  • Batched async PutMetricData (5s flush + flush on SIGTERM/SIGINT/beforeExit), low-cardinality dimensions (Environment, Provider, Tool, Key, Reason); no-op without AWS creds so it stays safe locally. Per-workspace/user cost stays in usage_log.

Type of Change

  • Improvement

Testing

Tested manually. Biome clean on the changed file; check:api-validation:strict passed (boundary audit passed).

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 9, 2026 12:16am

Request Review

@cursor

cursor Bot commented Jun 9, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes production observability and billing-adjacent cost/throttle signals; failures are swallowed so dashboards may gap, but request paths stay safe and AWS creds gate emission.

Overview
Replaces the OpenTelemetry meter backend for hosted-key metrics with direct CloudWatch PutMetricData under the Sim/HostedKey namespace, while keeping the same hostedKeyMetrics API so telemetry.ts and tool call sites are unchanged.

The new implementation buffers metric points in-process and flushes them asynchronously (5s interval, 1000-point threshold, and awaited drains on SIGTERM/SIGINT/beforeExit). It is a no-op when AWS_ACCESS_KEY_ID is missing (local dev). Dimensions are renamed to CloudWatch-style PascalCase (Provider, Tool, Key, Reason, plus Environment). Queue wait moves from an OTel histogram to individual QueueWaitDuration millisecond datapoints. A 10k buffer cap drops oldest points if flushing stalls; failed batches are logged and dropped so telemetry never blocks requests. flushHostedKeyMetrics() is exported for tests and explicit drains.

Reviewed by Cursor Bugbot for commit bbe5f42. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/lib/monitoring/metrics.ts
Comment thread apps/sim/lib/monitoring/metrics.ts Outdated
Comment thread apps/sim/lib/monitoring/metrics.ts Outdated
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces the OTel/Prometheus backend for hosted-key metrics with a CloudWatch push-based buffer (Sim/HostedKey namespace), keeping the hostedKeyMetrics facade unchanged so no call-site edits are needed. The motivation is sound: CloudWatch server-side aggregation eliminates the per-process counter-reset problem that broke cumulative Prometheus metrics in ephemeral trigger.dev workers.

  • The buffer/batch/flush design (5 s timer, FLUSH_THRESHOLD-triggered drain, beforeExit/SIGTERM/SIGINT exit handlers, MAX_BUFFER hard cap with oldest-drop) is well-structured; the synchronous buffer swap in flushHostedKeyMetrics prevents double-sends.
  • QueueWaitDuration was previously a histogram; individual MetricDatum pushes only give CloudWatch Sum/Avg/Min/Max, so percentile-based SLOs or alerts that relied on the histogram distribution will silently degrade to average-only data.
  • ENABLED is gated solely on AWS_ACCESS_KEY_ID being present, which returns false under IAM instance-profile or task-role auth; any future migration from static keys to role-based credentials would silently disable all metrics.

Confidence Score: 5/5

Safe to merge — the change is well-scoped, call sites are untouched, and the buffering logic correctly prevents data loss under normal operation.

The architectural switch from OTel to CloudWatch is sound and the implementation handles the core concerns (batching, exit flushing, memory bounding, no-op without creds). The two flagged items are observational concerns: one around histogram precision being lost for QueueWaitDuration, and one around the ENABLED guard not covering IAM role credentials. Neither causes incorrect behavior today given the documented deployment model with static AWS creds.

apps/sim/lib/monitoring/metrics.ts — specifically the ENABLED guard and QueueWaitDuration metric type.

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/metrics.ts Replaces OTel/Prometheus counters and histograms with a CloudWatch PutMetricData push-based buffer; core buffering/batching/exit-flush logic is sound, but QueueWaitDuration loses histogram percentile fidelity and ENABLED guard silently no-ops under IAM role auth.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller
    participant Enqueue as enqueue
    participant Buffer as Buffer
    participant Flush as flushHostedKeyMetrics
    participant CW as CloudWatch

    Caller->>Enqueue: recordUsed / recordFailed / etc.
    Enqueue->>Buffer: push MetricDatum
    Enqueue->>Enqueue: ensureBackground registers 5s timer and exit handlers
    alt buffer reaches FLUSH_THRESHOLD 1000
        Enqueue-->>Flush: fire-and-forget flush
    end

    Flush->>Buffer: swap pending buffer to empty
    Flush->>CW: PutMetricDataCommand batched
    alt success
        CW-->>Flush: OK
    else error
        Flush->>Flush: log warn and drop batch
    end

    Note over Enqueue,Flush: SIGTERM or SIGINT or beforeExit triggers await flush
Loading

Reviews (2): Last reviewed commit: "fix(metrics): await metric flush on shut..." | Re-trigger Greptile

Comment thread apps/sim/lib/monitoring/metrics.ts Outdated
Comment thread apps/sim/lib/monitoring/metrics.ts Outdated
Comment thread apps/sim/lib/monitoring/metrics.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

Addressed the Bugbot findings in bbe5f4265:

  1. Shutdown handlers don't await flush — the SIGTERM/SIGINT/beforeExit handler is now async and awaits flushHostedKeyMetrics(), so the final batch drains before exit. This is also what makes ephemeral trigger.dev workers reliably flush on teardown.
  2. Buffer grows without cap — added a hard MAX_BUFFER (10k) cap separate from the flush threshold (1k). If flushing stalls, the oldest points are dropped to bound memory, and the dropped count is logged on the next flush.
  3. Worker flush never invoked — flushing is automatic via the timer + the now-awaited exit handlers (which call the exported flushHostedKeyMetrics()), so ephemeral workers flush on beforeExit/SIGTERM without needing an explicit per-run call. Updated the module doc to describe this accurately; the export remains for explicit/early draining (e.g. tests).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bbe5f42. Configure here.

Comment thread apps/sim/lib/monitoring/metrics.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks TheodoreSpeaks merged commit 7434df9 into staging Jun 9, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the feat/hosted-key-cloudwatch-metrics branch June 9, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant