Skip to content

feat(metrics): emit hosted-key metrics to Grafana via OTel#4885

Merged
TheodoreSpeaks merged 5 commits into
stagingfrom
feat/hosted-key-metrics
Jun 4, 2026
Merged

feat(metrics): emit hosted-key metrics to Grafana via OTel#4885
TheodoreSpeaks merged 5 commits into
stagingfrom
feat/hosted-key-metrics

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

  • Add OTel metrics for hosted-key usage, cost, failures, throttles, and queue waits — replacing the platform.hosted_key.* spans that were silently dropped by the trace sampler
  • Wire a MeterProvider into the Next.js app's OTel SDK (metrics share the trace OTLP endpoint); trigger.dev already exports metrics, so both runtimes emit
  • Metrics: hosted_key.used, hosted_key.cost_charged, hosted_key.failed, hosted_key.throttled, hosted_key.upstream_rate_limited, hosted_key.queue_wait_duration, hosted_key.queue_wait_exceeded
  • Per-key attribution via a key label (env var name, never the secret) so a failing/throttled key is identifiable; failure rate = failed / (failed + used)
  • Low-cardinality labels only (provider, tool, reason, key); per-workspace/user cost stays in usage_log

Type of Change

  • New feature

Testing

Verified end-to-end against Grafana Cloud (grafanacloud-prom): drove exa tools and confirmed used/cost_charged on success, and failed{reason=rate_limited} + upstream_rate_limited + throttled on a forced 429, failed{reason=other} on a 4xx. Type-check and biome clean; check:api-validation:strict passes.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Replace the dropped platform.hosted_key.* spans with OTel counters/histograms for usage, cost, failures, throttles, and queue waits. Wire a MeterProvider into the Next.js OTel SDK (trigger.dev already exports metrics). Per-key attribution via a key label (env var name).
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 4, 2026 9:45pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 4, 2026

PR Summary

Medium Risk
Touches the hosted-key execution and billing-adjacent telemetry path; changes are additive observability but mis-labeled metrics could skew operational dashboards.

Overview
Hosted-key observability moves from sampled trace events to unsampled OpenTelemetry metrics, so usage, cost, throttles, and queue behavior show up reliably in Grafana instead of disappearing when traces are sampled.

The Next.js OTel bootstrap now exports metrics on the same OTLP endpoint as traces (normalizeOtlpMetricsUrl + PeriodicExportingMetricReader at 60s). A new hostedKeyMetrics module defines low-cardinality counters/histograms (hosted_key.used, hosted_key.cost_charged, hosted_key.failed, hosted_key.throttled, hosted_key.upstream_rate_limited, hosted_key.queue_wait_duration, etc.) with a key label for the env var name only.

PlatformEvents hosted-key helpers now call these recorders instead of trackPlatformEvent spans. executeTool records success (used + cost_charged), non-success paths (failed with rate_limited / auth / other), and updates the key label after re-acquire. Adds @opentelemetry/sdk-metrics.

Reviewed by Cursor Bugbot for commit e8dae53. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/tools/index.ts
Comment thread apps/sim/tools/index.ts
Comment thread apps/sim/tools/index.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 4, 2026

Greptile Summary

This PR replaces span-based platform.hosted_key.* events (which were silently dropped by the trace sampler) with proper OTel metrics, wiring a PeriodicExportingMetricReader into the Next.js NodeSDK and emitting counters/histograms for usage, cost, failures, throttles, and queue waits.

  • apps/sim/lib/monitoring/metrics.ts — new file defining lazy-initialized, no-op-safe OTel instruments (hosted_key.used, hosted_key.cost_charged, hosted_key.failed, hosted_key.throttled, hosted_key.upstream_rate_limited, hosted_key.queue_wait_duration, hosted_key.queue_wait_exceeded) with low-cardinality labels (provider, tool, reason, key env-var name).
  • apps/sim/instrumentation-node.ts — adds PeriodicExportingMetricReader (60 s flush) sharing the existing OTLP endpoint/headers; normalizeOtlpMetricsUrl swaps the /v1/traces suffix for /v1/metrics.
  • apps/sim/tools/index.ts — threads envVarName through applyHostedKeyCostToResult, adds classifyHostedKeyFailure, and records used/cost_charged on success and failed on both throwing and non-throwing failure paths; extends the retry-requeue path to update metric labels when a fresh key is acquired.

Confidence Score: 4/5

Safe to merge; the new metric plumbing is correct for all common paths and the OTel no-op fallback ensures nothing breaks without a provider configured.

The metric recording logic is sound across the normal success, direct-failure, and retry-exhaustion paths. The one concrete data-quality gap is in the requeue edge case: when retries are exhausted, a fresh key is acquired, and the requeued attempt itself fails — recordFailed fires with reason=rate_limited (from the original 429 error object) but attributed to the freshly acquired key, which never actually hit the rate limit.

apps/sim/tools/index.ts — specifically the requeue path around line 1237 where hostedKeyForMetrics.key is mutated before the requeued call executes.

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/metrics.ts New file; defines lazy-initialized OTel counters/histograms for hosted-key metrics using module-level singletons. Clean no-op-safe pattern; all call-sites are safe to invoke without a registered MeterProvider.
apps/sim/instrumentation-node.ts Adds a PeriodicExportingMetricReader (60 s interval) sharing the trace OTLP endpoint/headers; normalizeOtlpMetricsUrl correctly handles traces-suffixed URLs, and the metricReader is wired into NodeSDK alongside existing spanProcessors.
apps/sim/lib/core/telemetry.ts Platform events for throttle, rate-limit, queue-wait, queue-wait-exceeded, and unknown-model-cost are migrated from span-based trackPlatformEvent to the new hostedKeyMetrics calls; removes 5 event definitions that were silently dropped by the sampler.
apps/sim/tools/index.ts Adds classifyHostedKeyFailure, extends RetryContext with provider, and wires recordUsed/recordFailed/recordCostCharged at success/failure sites. The requeue path mutates hostedKeyForMetrics.key to the fresh key before the requeued call runs, which can misattribute the terminal failure classification to the wrong key.
apps/sim/package.json Adds @opentelemetry/sdk-metrics ^2.7.0 explicitly; exporter-metrics-otlp-http was already present as a dependency.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeTool called] --> B{isUsingHostedKey?}
    B -- No --> Z[execute normally, no metrics]
    B -- Yes --> C[set hostedKeyForMetrics]
    C --> D{directExecution?}

    D -- Yes --> E[run directExecution]
    E --> F{finalResult.success?}
    F -- Yes --> G[applyHostedKeyCostToResult\nrecordUsed + recordCostCharged]
    F -- No --> H[recordFailed reason=other]

    D -- No --> I[executeWithRetry]
    I --> J{throws?}
    J -- No --> K{finalResult.success?}
    K -- Yes --> G
    K -- No --> H

    J -- Yes: non-429 --> L[outer catch\nrecordFailed classifyHostedKeyFailure]
    J -- Yes: 429 retries exhausted --> M{reacquireAfterRetriesExhausted?}
    M -- requeue succeeds --> N[return result\nrecordUsed + recordCostCharged]
    M -- requeue fails or requeued call throws --> O[recordThrottled\nouter catch recordFailed reason=rate_limited\nkey=FRESH_KEY ⚠️]
    M -- no requeue --> P[recordThrottled\nouter catch recordFailed]
Loading

Reviews (3): Last reviewed commit: "fix(metrics): parse OTLP metrics URL via..." | Re-trigger Greptile

Comment thread apps/sim/tools/index.ts Outdated
- Re-point used/cost/failed labels at the freshly acquired key after reacquire
- Classify quota-style 401/403 as rate_limited (mirror isRateLimitError)
- Count returned success:false runs (e.g. deep_research polling) as failed
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/core/telemetry.ts
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/instrumentation-node.ts
Handles query strings and trailing slashes so the /v1/traces->/v1/metrics
swap can't produce a malformed endpoint, matching normalizeOtlpTracesUrl.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e8dae53. Configure here.

Comment thread apps/sim/tools/index.ts
@TheodoreSpeaks TheodoreSpeaks merged commit 530b2c0 into staging Jun 4, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the feat/hosted-key-metrics branch June 4, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant