[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224
Open
JNSimba wants to merge 3 commits intoapache:masterfrom
Open
[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224JNSimba wants to merge 3 commits intoapache:masterfrom
JNSimba wants to merge 3 commits intoapache:masterfrom
Conversation
Add per-job granularity metrics (scanned_rows, load_bytes, filtered_rows, succeed/failed_task_count, offset) with job_id and job_name labels to the /metrics endpoint, enabling Grafana monitoring at individual job level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Member
Author
|
run buildall |
Member
Author
|
/review |
Contributor
There was a problem hiding this comment.
Found 1 issue during review.
fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java:current_offset/end_offsetare exported as raw label values even though the offset providers return JSON strings. Both metric visitors serialize labels without escaping, so a normal non-empty offset makes/metricsinvalid Prometheus text and/metrics?type=jsoninvalid JSON. Because these labels also change as the job advances, they would create a new Prometheus series identity on each offset update even if escaping were added later.
Critical checkpoint conclusions:
- Goal of the task: Add per-job streaming-job metrics for monitoring. Partially achieved; the per-job counters/gauges are reasonable, but the offset-label design breaks exporter correctness.
- Small/clear/focused: Mostly focused, but the offset information should not be modeled as metric labels.
- Concurrency: No new locking or deadlock issue found in this patch.
MetricRepo.getMetric()is synchronized, andJobManager.queryJobs()reads from aConcurrentHashMap, so the new traversal is weakly consistent but safe enough for metrics. - Lifecycle/static initialization: No special lifecycle or static-init issue found.
- Configuration changes: None.
- Compatibility/incompatible changes: No FE/BE protocol or storage compatibility issue found.
- Functionally parallel code paths: The bug affects both
/metricsand/metrics?type=jsonbecause both visitors serialize the same labels. - Special conditional checks: No issue beyond the unsafe assumption that arbitrary offset strings are valid metric labels.
- Test coverage: A regression test was added, but there is still no direct coverage for label escaping/export serialization. I did not run the test suite in this review.
- Observability: Per-job metrics are useful, but dynamic offset labels are not safe observability design because they break encoding and cause series churn.
- Transaction/persistence: Not applicable for this patch.
- Data writes/modifications: Not applicable for this patch.
- FE/BE variable passing: Not applicable for this patch.
- Performance: Walking insert jobs per scrape is acceptable; dynamic offset labels would create avoidable scrape/TSDB churn.
- Other issues: None beyond the finding above.
Contributor
FE UT Coverage ReportIncrement line coverage |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
run buildall |
Offset is a JSON string that changes frequently, which would create series churn in Prometheus and break metric serialization. Remove the offset metric; offset info can be viewed via SHOW STREAMING JOBS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
job_idandjob_namelabelsstreaming_job_per_job_scanned_rows,streaming_job_per_job_load_bytes,streaming_job_per_job_filtered_rows,streaming_job_per_job_succeed_task_count,streaming_job_per_job_failed_task_countApproach
Follows
generateBackendsTabletMetrics()pattern: on each/metricsrequest, remove all previous per-job metrics then re-register with current job data. This ensures values are always up-to-date and stale jobs are cleaned up automatically.Offset info is intentionally excluded from metric labels to avoid Prometheus series churn and serialization issues. Offset can be viewed via
SHOW STREAMING JOBSorjobs("type"="insert")TVF.Test plan
/metrics?type=jsonwith correctjob_idandjob_namelabelstest_streaming_mysql_job_metrics.groovyregression test🤖 Generated with Claude Code