Summary
The sum time-grouping aggregation in health alert lookups does not correctly compute total event volume when used on RRD_ALGORITHM_INCREMENTAL dimensions with update_every other than 1 second. The result is off by a factor of update_every.
Root Cause
The ingestion pipeline divides the counter delta by update_every to convert volume to rate:
// src/database/rrdset-collection.c:387
new_value /= (NETDATA_DOUBLE)st->update_every;
The stored value is a rate (events/second). When the sum time-grouping aggregates these values, it simply adds them up without multiplying by the time interval each point represents:
// src/web/api/queries/sum/sum.h:32
static inline void tg_sum_add(RRDR *r, NETDATA_DOUBLE value) {
struct tg_sum *g = (struct tg_sum *)r->time_grouping.data;
g->sum += value; // raw addition, no interval scaling
}
Example
6000 events over 1 minute, uniform rate of 100 events/s:
update_every |
Points in 1m |
Each point value |
sum -1m result |
Actual volume |
| 1s |
60 |
100 events/s |
6000 |
6000 |
| 5s |
12 |
100 events/s |
1200 |
6000 |
| 10s |
6 |
100 events/s |
600 |
6000 |
The result equals total_events / update_every, which only equals total_events when update_every = 1.
Impact
- Affected alerts: Any health alert using
lookup: sum on an INCREMENTAL dimension with update_every != 1. For example, web_log.conf uses lookup: sum -1m unaligned to count "number of HTTP requests in the last minute" — this is correct only at the default 1s collection interval.
- Percentage-based alerts are unaffected: Alerts that compute percentages (e.g.,
$this * 100 / $total) are not affected because both numerator and denominator are off by the same factor, so the ratio cancels out.
- Severity: Low in practice because most collectors default to
update_every=1, but the behavior is incorrect and surprising for users who configure longer intervals.
Suggested Fix
Either:
-
Add a volume (or integral) time-grouping that multiplies each rate value by the point's time span before summing, correctly computing the area under the rate curve.
-
Make sum interval-aware when operating on INCREMENTAL dimensions — multiply each value by the actual time delta between consecutive points.
-
Document the limitation — clarify that sum on rate-based (INCREMENTAL) dimensions only gives correct volume at 1s collection intervals, and explain how to use calc: $this * $update_every as a workaround.
Option 1 is cleanest — a dedicated volume aggregation that always correctly integrates rate data, regardless of collection interval. This would also be useful in the API (&group=volume).
How This Was Found
During review of PR #22077 (audit subsystem monitoring), Copilot flagged lookup: sum -1m of lost on an INCREMENTAL chart. Investigation traced through the ingestion and query code paths to confirm the behavior.
Summary
The
sumtime-grouping aggregation in health alert lookups does not correctly compute total event volume when used onRRD_ALGORITHM_INCREMENTALdimensions withupdate_everyother than 1 second. The result is off by a factor ofupdate_every.Root Cause
The ingestion pipeline divides the counter delta by
update_everyto convert volume to rate:The stored value is a rate (events/second). When the
sumtime-grouping aggregates these values, it simply adds them up without multiplying by the time interval each point represents:Example
6000 events over 1 minute, uniform rate of 100 events/s:
update_everysum -1mresultThe result equals
total_events / update_every, which only equalstotal_eventswhenupdate_every = 1.Impact
lookup: sumon an INCREMENTAL dimension withupdate_every != 1. For example,web_log.confuseslookup: sum -1m unalignedto count "number of HTTP requests in the last minute" — this is correct only at the default 1s collection interval.$this * 100 / $total) are not affected because both numerator and denominator are off by the same factor, so the ratio cancels out.update_every=1, but the behavior is incorrect and surprising for users who configure longer intervals.Suggested Fix
Either:
Add a
volume(orintegral) time-grouping that multiplies each rate value by the point's time span before summing, correctly computing the area under the rate curve.Make
suminterval-aware when operating on INCREMENTAL dimensions — multiply each value by the actual time delta between consecutive points.Document the limitation — clarify that
sumon rate-based (INCREMENTAL) dimensions only gives correct volume at 1s collection intervals, and explain how to usecalc: $this * $update_everyas a workaround.Option 1 is cleanest — a dedicated
volumeaggregation that always correctly integrates rate data, regardless of collection interval. This would also be useful in the API (&group=volume).How This Was Found
During review of PR #22077 (audit subsystem monitoring), Copilot flagged
lookup: sum -1m of loston an INCREMENTAL chart. Investigation traced through the ingestion and query code paths to confirm the behavior.