Health alerts: sum aggregation on INCREMENTAL dimensions produces wrong volume for non-1s update_every

## Summary

The `sum` time-grouping aggregation in health alert lookups does not correctly compute total event volume when used on `RRD_ALGORITHM_INCREMENTAL` dimensions with `update_every` other than 1 second. The result is off by a factor of `update_every`.

## Root Cause

The ingestion pipeline divides the counter delta by `update_every` to convert volume to rate:

```c
// src/database/rrdset-collection.c:387
new_value /= (NETDATA_DOUBLE)st->update_every;
```

The stored value is a **rate** (events/second). When the `sum` time-grouping aggregates these values, it simply adds them up without multiplying by the time interval each point represents:

```c
// src/web/api/queries/sum/sum.h:32
static inline void tg_sum_add(RRDR *r, NETDATA_DOUBLE value) {
    struct tg_sum *g = (struct tg_sum *)r->time_grouping.data;
    g->sum += value;   // raw addition, no interval scaling
}
```

## Example

6000 events over 1 minute, uniform rate of 100 events/s:

| `update_every` | Points in 1m | Each point value | `sum -1m` result | Actual volume |
|:-:|:-:|:-:|:-:|:-:|
| 1s | 60 | 100 events/s | **6000** | 6000 |
| 5s | 12 | 100 events/s | **1200** | 6000 |
| 10s | 6 | 100 events/s | **600** | 6000 |

The result equals `total_events / update_every`, which only equals `total_events` when `update_every = 1`.

## Impact

- **Affected alerts**: Any health alert using `lookup: sum` on an INCREMENTAL dimension with `update_every != 1`. For example, `web_log.conf` uses `lookup: sum -1m unaligned` to count "number of HTTP requests in the last minute" — this is correct only at the default 1s collection interval.
- **Percentage-based alerts are unaffected**: Alerts that compute percentages (e.g., `$this * 100 / $total`) are not affected because both numerator and denominator are off by the same factor, so the ratio cancels out.
- **Severity**: Low in practice because most collectors default to `update_every=1`, but the behavior is incorrect and surprising for users who configure longer intervals.

## Suggested Fix

Either:

1. **Add a `volume` (or `integral`) time-grouping** that multiplies each rate value by the point's time span before summing, correctly computing the area under the rate curve.

2. **Make `sum` interval-aware** when operating on INCREMENTAL dimensions — multiply each value by the actual time delta between consecutive points.

3. **Document the limitation** — clarify that `sum` on rate-based (INCREMENTAL) dimensions only gives correct volume at 1s collection intervals, and explain how to use `calc: $this * $update_every` as a workaround.

Option 1 is cleanest — a dedicated `volume` aggregation that always correctly integrates rate data, regardless of collection interval. This would also be useful in the API (`&group=volume`).

## How This Was Found

During review of PR #22077 (audit subsystem monitoring), Copilot flagged `lookup: sum -1m of lost` on an INCREMENTAL chart. Investigation traced through the ingestion and query code paths to confirm the behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health alerts: sum aggregation on INCREMENTAL dimensions produces wrong volume for non-1s update_every #22112

Summary

Root Cause

Example

Impact

Suggested Fix

How This Was Found

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

`update_every`	Points in 1m	Each point value	`sum -1m` result	Actual volume
1s	60	100 events/s	6000	6000
5s	12	100 events/s	1200	6000
10s	6	100 events/s	600	6000

Health alerts: sum aggregation on INCREMENTAL dimensions produces wrong volume for non-1s update_every #22112

Description

Summary

Root Cause

Example

Impact

Suggested Fix

How This Was Found

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions