feat(metrics): emit hosted-key metrics to Grafana via OTel#4885
Conversation
Replace the dropped platform.hosted_key.* spans with OTel counters/histograms for usage, cost, failures, throttles, and queue waits. Wire a MeterProvider into the Next.js OTel SDK (trigger.dev already exports metrics). Per-key attribution via a key label (env var name).
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview The Next.js OTel bootstrap now exports metrics on the same OTLP endpoint as traces (
Reviewed by Cursor Bugbot for commit e8dae53. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR replaces span-based
Confidence Score: 4/5Safe to merge; the new metric plumbing is correct for all common paths and the OTel no-op fallback ensures nothing breaks without a provider configured. The metric recording logic is sound across the normal success, direct-failure, and retry-exhaustion paths. The one concrete data-quality gap is in the requeue edge case: when retries are exhausted, a fresh key is acquired, and the requeued attempt itself fails — apps/sim/tools/index.ts — specifically the requeue path around line 1237 where Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[executeTool called] --> B{isUsingHostedKey?}
B -- No --> Z[execute normally, no metrics]
B -- Yes --> C[set hostedKeyForMetrics]
C --> D{directExecution?}
D -- Yes --> E[run directExecution]
E --> F{finalResult.success?}
F -- Yes --> G[applyHostedKeyCostToResult\nrecordUsed + recordCostCharged]
F -- No --> H[recordFailed reason=other]
D -- No --> I[executeWithRetry]
I --> J{throws?}
J -- No --> K{finalResult.success?}
K -- Yes --> G
K -- No --> H
J -- Yes: non-429 --> L[outer catch\nrecordFailed classifyHostedKeyFailure]
J -- Yes: 429 retries exhausted --> M{reacquireAfterRetriesExhausted?}
M -- requeue succeeds --> N[return result\nrecordUsed + recordCostCharged]
M -- requeue fails or requeued call throws --> O[recordThrottled\nouter catch recordFailed reason=rate_limited\nkey=FRESH_KEY ⚠️]
M -- no requeue --> P[recordThrottled\nouter catch recordFailed]
Reviews (3): Last reviewed commit: "fix(metrics): parse OTLP metrics URL via..." | Re-trigger Greptile |
- Re-point used/cost/failed labels at the freshly acquired key after reacquire - Classify quota-style 401/403 as rate_limited (mirror isRateLimitError) - Count returned success:false runs (e.g. deep_research polling) as failed
|
@greptile review |
…trics # Conflicts: # apps/sim/tools/index.ts
|
@greptile review |
Handles query strings and trailing slashes so the /v1/traces->/v1/metrics swap can't produce a malformed endpoint, matching normalizeOtlpTracesUrl.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e8dae53. Configure here.

Summary
platform.hosted_key.*spans that were silently dropped by the trace samplerhosted_key.used,hosted_key.cost_charged,hosted_key.failed,hosted_key.throttled,hosted_key.upstream_rate_limited,hosted_key.queue_wait_duration,hosted_key.queue_wait_exceededkeylabel (env var name, never the secret) so a failing/throttled key is identifiable; failure rate =failed / (failed + used)usage_logType of Change
Testing
Verified end-to-end against Grafana Cloud (grafanacloud-prom): drove exa tools and confirmed
used/cost_chargedon success, andfailed{reason=rate_limited}+upstream_rate_limited+throttledon a forced 429,failed{reason=other}on a 4xx. Type-check and biome clean;check:api-validation:strictpasses.Checklist