Skip to content

[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224

Open
JNSimba wants to merge 3 commits intoapache:masterfrom
JNSimba:add_per_job_metrics
Open

[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224
JNSimba wants to merge 3 commits intoapache:masterfrom
JNSimba:add_per_job_metrics

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 8, 2026

Summary

  • Add per-job granularity metrics for streaming insert jobs with job_id and job_name labels
  • New metrics: streaming_job_per_job_scanned_rows, streaming_job_per_job_load_bytes, streaming_job_per_job_filtered_rows, streaming_job_per_job_succeed_task_count, streaming_job_per_job_failed_task_count
  • Existing global aggregated metrics remain unchanged
  • Follow-up to [Improve](StreamingJob) add more metrics to observe the streaming job #60493

Approach

Follows generateBackendsTabletMetrics() pattern: on each /metrics request, remove all previous per-job metrics then re-register with current job data. This ensures values are always up-to-date and stale jobs are cleaned up automatically.

Offset info is intentionally excluded from metric labels to avoid Prometheus series churn and serialization issues. Offset can be viewed via SHOW STREAMING JOBS or jobs("type"="insert") TVF.

Test plan

  • Verify per-job metrics appear in /metrics?type=json with correct job_id and job_name labels
  • Verify existing global streaming job metrics still present
  • Verify FE replay is not affected
  • Run test_streaming_mysql_job_metrics.groovy regression test

🤖 Generated with Claude Code

Add per-job granularity metrics (scanned_rows, load_bytes, filtered_rows,
succeed/failed_task_count, offset) with job_id and job_name labels to the
/metrics endpoint, enabling Grafana monitoring at individual job level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue during review.

  1. fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java: current_offset / end_offset are exported as raw label values even though the offset providers return JSON strings. Both metric visitors serialize labels without escaping, so a normal non-empty offset makes /metrics invalid Prometheus text and /metrics?type=json invalid JSON. Because these labels also change as the job advances, they would create a new Prometheus series identity on each offset update even if escaping were added later.

Critical checkpoint conclusions:

  • Goal of the task: Add per-job streaming-job metrics for monitoring. Partially achieved; the per-job counters/gauges are reasonable, but the offset-label design breaks exporter correctness.
  • Small/clear/focused: Mostly focused, but the offset information should not be modeled as metric labels.
  • Concurrency: No new locking or deadlock issue found in this patch. MetricRepo.getMetric() is synchronized, and JobManager.queryJobs() reads from a ConcurrentHashMap, so the new traversal is weakly consistent but safe enough for metrics.
  • Lifecycle/static initialization: No special lifecycle or static-init issue found.
  • Configuration changes: None.
  • Compatibility/incompatible changes: No FE/BE protocol or storage compatibility issue found.
  • Functionally parallel code paths: The bug affects both /metrics and /metrics?type=json because both visitors serialize the same labels.
  • Special conditional checks: No issue beyond the unsafe assumption that arbitrary offset strings are valid metric labels.
  • Test coverage: A regression test was added, but there is still no direct coverage for label escaping/export serialization. I did not run the test suite in this review.
  • Observability: Per-job metrics are useful, but dynamic offset labels are not safe observability design because they break encoding and cause series churn.
  • Transaction/persistence: Not applicable for this patch.
  • Data writes/modifications: Not applicable for this patch.
  • FE/BE variable passing: Not applicable for this patch.
  • Performance: Walking insert jobs per scrape is acceptable; dynamic offset labels would create avoidable scrape/TSDB churn.
  • Other issues: None beyond the finding above.

Comment thread fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Outdated
@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 2.67% (2/75) 🎉
Increment coverage report
Complete coverage report

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

run buildall

Offset is a JSON string that changes frequently, which would create
series churn in Prometheus and break metric serialization. Remove
the offset metric; offset info can be viewed via SHOW STREAMING JOBS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 2.67% (2/75) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants