feat(metrics): emit hosted-key metrics to Grafana via OTel by TheodoreSpeaks · Pull Request #4885 · simstudioai/sim

TheodoreSpeaks · 2026-06-04T20:12:34Z

Summary

Add OTel metrics for hosted-key usage, cost, failures, throttles, and queue waits — replacing the platform.hosted_key.* spans that were silently dropped by the trace sampler
Wire a MeterProvider into the Next.js app's OTel SDK (metrics share the trace OTLP endpoint); trigger.dev already exports metrics, so both runtimes emit
Metrics: hosted_key.used, hosted_key.cost_charged, hosted_key.failed, hosted_key.throttled, hosted_key.upstream_rate_limited, hosted_key.queue_wait_duration, hosted_key.queue_wait_exceeded
Per-key attribution via a key label (env var name, never the secret) so a failing/throttled key is identifiable; failure rate = failed / (failed + used)
Low-cardinality labels only (provider, tool, reason, key); per-workspace/user cost stays in usage_log

Type of Change

New feature

Testing

Verified end-to-end against Grafana Cloud (grafanacloud-prom): drove exa tools and confirmed used/cost_charged on success, and failed{reason=rate_limited} + upstream_rate_limited + throttled on a forced 429, failed{reason=other} on a 4xx. Type-check and biome clean; check:api-validation:strict passes.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Replace the dropped platform.hosted_key.* spans with OTel counters/histograms for usage, cost, failures, throttles, and queue waits. Wire a MeterProvider into the Next.js OTel SDK (trigger.dev already exports metrics). Per-key attribution via a key label (env var name).

vercel · 2026-06-04T20:12:41Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Jun 4, 2026 9:45pm

cursor · 2026-06-04T20:12:44Z

PR Summary

Medium Risk
Touches the hosted-key execution and billing-adjacent telemetry path; changes are additive observability but mis-labeled metrics could skew operational dashboards.

Overview
Hosted-key observability moves from sampled trace events to unsampled OpenTelemetry metrics, so usage, cost, throttles, and queue behavior show up reliably in Grafana instead of disappearing when traces are sampled.

The Next.js OTel bootstrap now exports metrics on the same OTLP endpoint as traces (normalizeOtlpMetricsUrl + PeriodicExportingMetricReader at 60s). A new hostedKeyMetrics module defines low-cardinality counters/histograms (hosted_key.used, hosted_key.cost_charged, hosted_key.failed, hosted_key.throttled, hosted_key.upstream_rate_limited, hosted_key.queue_wait_duration, etc.) with a key label for the env var name only.

PlatformEvents hosted-key helpers now call these recorders instead of trackPlatformEvent spans. executeTool records success (used + cost_charged), non-success paths (failed with rate_limited / auth / other), and updates the key label after re-acquire. Adds @opentelemetry/sdk-metrics.

^{Reviewed by Cursor Bugbot for commit e8dae53. Bugbot is set up for automated code reviews on this repo. Configure here.}

greptile-apps · 2026-06-04T20:18:52Z

Greptile Summary

This PR replaces span-based platform.hosted_key.* events (which were silently dropped by the trace sampler) with proper OTel metrics, wiring a PeriodicExportingMetricReader into the Next.js NodeSDK and emitting counters/histograms for usage, cost, failures, throttles, and queue waits.

apps/sim/lib/monitoring/metrics.ts — new file defining lazy-initialized, no-op-safe OTel instruments (hosted_key.used, hosted_key.cost_charged, hosted_key.failed, hosted_key.throttled, hosted_key.upstream_rate_limited, hosted_key.queue_wait_duration, hosted_key.queue_wait_exceeded) with low-cardinality labels (provider, tool, reason, key env-var name).
apps/sim/instrumentation-node.ts — adds PeriodicExportingMetricReader (60 s flush) sharing the existing OTLP endpoint/headers; normalizeOtlpMetricsUrl swaps the /v1/traces suffix for /v1/metrics.
apps/sim/tools/index.ts — threads envVarName through applyHostedKeyCostToResult, adds classifyHostedKeyFailure, and records used/cost_charged on success and failed on both throwing and non-throwing failure paths; extends the retry-requeue path to update metric labels when a fresh key is acquired.

Confidence Score: 4/5

Safe to merge; the new metric plumbing is correct for all common paths and the OTel no-op fallback ensures nothing breaks without a provider configured.

The metric recording logic is sound across the normal success, direct-failure, and retry-exhaustion paths. The one concrete data-quality gap is in the requeue edge case: when retries are exhausted, a fresh key is acquired, and the requeued attempt itself fails — recordFailed fires with reason=rate_limited (from the original 429 error object) but attributed to the freshly acquired key, which never actually hit the rate limit.

apps/sim/tools/index.ts — specifically the requeue path around line 1237 where hostedKeyForMetrics.key is mutated before the requeued call executes.

Important Files Changed

Filename	Overview
apps/sim/lib/monitoring/metrics.ts	New file; defines lazy-initialized OTel counters/histograms for hosted-key metrics using module-level singletons. Clean no-op-safe pattern; all call-sites are safe to invoke without a registered MeterProvider.
apps/sim/instrumentation-node.ts	Adds a PeriodicExportingMetricReader (60 s interval) sharing the trace OTLP endpoint/headers; normalizeOtlpMetricsUrl correctly handles traces-suffixed URLs, and the metricReader is wired into NodeSDK alongside existing spanProcessors.
apps/sim/lib/core/telemetry.ts	Platform events for throttle, rate-limit, queue-wait, queue-wait-exceeded, and unknown-model-cost are migrated from span-based trackPlatformEvent to the new hostedKeyMetrics calls; removes 5 event definitions that were silently dropped by the sampler.
apps/sim/tools/index.ts	Adds classifyHostedKeyFailure, extends RetryContext with provider, and wires recordUsed/recordFailed/recordCostCharged at success/failure sites. The requeue path mutates hostedKeyForMetrics.key to the fresh key before the requeued call runs, which can misattribute the terminal failure classification to the wrong key.
apps/sim/package.json	Adds @opentelemetry/sdk-metrics ^2.7.0 explicitly; exporter-metrics-otlp-http was already present as a dependency.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeTool called] --> B{isUsingHostedKey?}
    B -- No --> Z[execute normally, no metrics]
    B -- Yes --> C[set hostedKeyForMetrics]
    C --> D{directExecution?}

    D -- Yes --> E[run directExecution]
    E --> F{finalResult.success?}
    F -- Yes --> G[applyHostedKeyCostToResult\nrecordUsed + recordCostCharged]
    F -- No --> H[recordFailed reason=other]

    D -- No --> I[executeWithRetry]
    I --> J{throws?}
    J -- No --> K{finalResult.success?}
    K -- Yes --> G
    K -- No --> H

    J -- Yes: non-429 --> L[outer catch\nrecordFailed classifyHostedKeyFailure]
    J -- Yes: 429 retries exhausted --> M{reacquireAfterRetriesExhausted?}
    M -- requeue succeeds --> N[return result\nrecordUsed + recordCostCharged]
    M -- requeue fails or requeued call throws --> O[recordThrottled\nouter catch recordFailed reason=rate_limited\nkey=FRESH_KEY ⚠️]
    M -- no requeue --> P[recordThrottled\nouter catch recordFailed]

_{Reviews (3): Last reviewed commit: "fix(metrics): parse OTLP metrics URL via..." | Re-trigger Greptile}

- Re-point used/cost/failed labels at the freshly acquired key after reacquire - Classify quota-style 401/403 as rate_limited (mirror isRateLimitError) - Count returned success:false runs (e.g. deep_research polling) as failed

TheodoreSpeaks · 2026-06-04T21:25:57Z

@greptile review

…ted retries

…trics # Conflicts: # apps/sim/tools/index.ts

TheodoreSpeaks · 2026-06-04T21:42:15Z

@greptile review

Handles query strings and trailing slashes so the /v1/traces->/v1/metrics swap can't produce a malformed endpoint, matching normalizeOtlpTracesUrl.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e8dae53. Configure here.}

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread apps/sim/tools/index.ts

Comment thread apps/sim/tools/index.ts

Comment thread apps/sim/tools/index.ts

vercel Bot deployed to Preview June 4, 2026 20:16 View deployment

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread apps/sim/tools/index.ts Outdated

fix(metrics): correct hosted-key failure attribution

2a873e5

- Re-point used/cost/failed labels at the freshly acquired key after reacquire - Classify quota-style 401/403 as rate_limited (mirror isRateLimitError) - Count returned success:false runs (e.g. deep_research polling) as failed

vercel Bot temporarily deployed to Preview June 4, 2026 21:26 Inactive

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread apps/sim/lib/core/telemetry.ts

TheodoreSpeaks added 2 commits June 4, 2026 14:40

fix(metrics): label hosted_key.throttled with real provider on exhaus…

72c0c06

…ted retries

Merge remote-tracking branch 'origin/staging' into feat/hosted-key-me…

e56f0f7

…trics # Conflicts: # apps/sim/tools/index.ts

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread apps/sim/instrumentation-node.ts

vercel Bot deployed to Preview June 4, 2026 21:45 View deployment

fix(metrics): parse OTLP metrics URL via URL/pathname, not string suffix

e8dae53

Handles query strings and trailing slashes so the /v1/traces->/v1/metrics swap can't produce a malformed endpoint, matching normalizeOtlpTracesUrl.

vercel Bot temporarily deployed to Preview June 4, 2026 21:45 Inactive

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread apps/sim/tools/index.ts

TheodoreSpeaks merged commit 530b2c0 into staging Jun 4, 2026
14 checks passed

TheodoreSpeaks deleted the feat/hosted-key-metrics branch June 4, 2026 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): emit hosted-key metrics to Grafana via OTel#4885

feat(metrics): emit hosted-key metrics to Grafana via OTel#4885
TheodoreSpeaks merged 5 commits into
stagingfrom
feat/hosted-key-metrics

TheodoreSpeaks commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 4, 2026

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 4, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheodoreSpeaks commented Jun 4, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 4, 2026

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 4, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 4, 2026 •

edited

Loading

cursor Bot commented Jun 4, 2026 •

edited

Loading

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading