feat(run-engine,webapp): always report worker queue length metrics#4029
feat(run-engine,webapp): always report worker queue length metrics#4029ericallam wants to merge 3 commits into
Conversation
The runqueue.workerQueue.length gauge only observed worker queues that a dequeue had registered, so a queue's depth stopped being reported once dequeues stopped (or was never reported for a queue that backed up before anything dequeued from it). A periodic observer now refreshes the observed set from the WorkerInstanceGroup records instead, so every active worker queue (and its scheduled split variant) keeps reporting its length regardless of dequeue activity. Off by default; enable per-service via an env var. Reads from the replica and skips configured cloud providers.
The GET and POST /api/v1/workers endpoints backed a CLI command group that is no longer registered, so they had no reachable consumer. Remove them.
… timeout Disable the execution workers and batch consumers in the worker queue observation test. It only needs enqueue + processMasterQueue + the observer gauge, and the extra workers add Redis connections and make engine.quit() hang on worker shutdown when the shard's Redis is under pressure, timing the test out in CI.
|
WalkthroughThis PR adds a continuous worker-queue observer to 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install failed: private package registry requires authentication. Disable ESLint in CodeRabbit settings or use public packages. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
WalkthroughThis PR adds a continuous worker-queue observer to 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
The
runqueue.workerQueue.lengthgauge only reported a worker queue's depth while runs were being dequeued from it. When dequeues stop, the metric goes stale or missing, so a queue that has backed up because nothing is draining it can't be alerted on. This adds a small observer that refreshes the observed set of worker queues from theWorkerInstanceGrouprecords on an interval, so every active worker queue (and its scheduled split variant) keeps reporting its length regardless of dequeue activity.The observer is off by default and enabled per service via
RUN_ENGINE_WORKER_QUEUE_OBSERVER_ENABLED, reads from the read replica, and skips a configurable set of cloud providers (RUN_ENGINE_WORKER_QUEUE_OBSERVER_EXCLUDED_CLOUD_PROVIDERS, defaultdigitalocean). When enabled it is the source of truth for the observed set, so the per-dequeue registration is skipped on that instance, and it groups by worker queue so the per-instance duplicates collapse to the true depth.Also removes the unused
GET/POST /api/v1/workersendpoints. Their only consumer was a CLI command group that is no longer registered.Verification
Verified end to end against a local stack: the gauge reports each worker queue's length with no dequeues happening, excludes the configured providers, includes hidden groups, and the removed endpoints return as if they never existed. Added a run-engine test (
workerQueueObservation.test.ts).