Skip to content

Health alerts: sum aggregation on INCREMENTAL dimensions produces wrong volume for non-1s update_every #22112

@ktsaou

Description

@ktsaou

Summary

The sum time-grouping aggregation in health alert lookups does not correctly compute total event volume when used on RRD_ALGORITHM_INCREMENTAL dimensions with update_every other than 1 second. The result is off by a factor of update_every.

Root Cause

The ingestion pipeline divides the counter delta by update_every to convert volume to rate:

// src/database/rrdset-collection.c:387
new_value /= (NETDATA_DOUBLE)st->update_every;

The stored value is a rate (events/second). When the sum time-grouping aggregates these values, it simply adds them up without multiplying by the time interval each point represents:

// src/web/api/queries/sum/sum.h:32
static inline void tg_sum_add(RRDR *r, NETDATA_DOUBLE value) {
    struct tg_sum *g = (struct tg_sum *)r->time_grouping.data;
    g->sum += value;   // raw addition, no interval scaling
}

Example

6000 events over 1 minute, uniform rate of 100 events/s:

update_every Points in 1m Each point value sum -1m result Actual volume
1s 60 100 events/s 6000 6000
5s 12 100 events/s 1200 6000
10s 6 100 events/s 600 6000

The result equals total_events / update_every, which only equals total_events when update_every = 1.

Impact

  • Affected alerts: Any health alert using lookup: sum on an INCREMENTAL dimension with update_every != 1. For example, web_log.conf uses lookup: sum -1m unaligned to count "number of HTTP requests in the last minute" — this is correct only at the default 1s collection interval.
  • Percentage-based alerts are unaffected: Alerts that compute percentages (e.g., $this * 100 / $total) are not affected because both numerator and denominator are off by the same factor, so the ratio cancels out.
  • Severity: Low in practice because most collectors default to update_every=1, but the behavior is incorrect and surprising for users who configure longer intervals.

Suggested Fix

Either:

  1. Add a volume (or integral) time-grouping that multiplies each rate value by the point's time span before summing, correctly computing the area under the rate curve.

  2. Make sum interval-aware when operating on INCREMENTAL dimensions — multiply each value by the actual time delta between consecutive points.

  3. Document the limitation — clarify that sum on rate-based (INCREMENTAL) dimensions only gives correct volume at 1s collection intervals, and explain how to use calc: $this * $update_every as a workaround.

Option 1 is cleanest — a dedicated volume aggregation that always correctly integrates rate data, regardless of collection interval. This would also be useful in the API (&group=volume).

How This Was Found

During review of PR #22077 (audit subsystem monitoring), Copilot flagged lookup: sum -1m of lost on an INCREMENTAL chart. Investigation traced through the ingestion and query code paths to confirm the behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions