Skip to content

Netdata fixes part 35#22975

Open
stelfrag wants to merge 9 commits into
netdata:masterfrom
stelfrag:netdata-fixes-35
Open

Netdata fixes part 35#22975
stelfrag wants to merge 9 commits into
netdata:masterfrom
stelfrag:netdata-fixes-35

Conversation

@stelfrag

@stelfrag stelfrag commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator
Summary

Summary by cubic

Fixes concurrency and lifetime issues in RAM-backed dimensions and alert readers, and makes chart last-access timestamps atomic. Prevents use-after-free and undefined casts without changing metric or API behavior.

  • Bug Fixes

    • RAM storage: cache data, memsize, and mode in the refcounted metric handle; detach stale handles from the Judy index on RRDDIM delete/replace; free buffers on final release; publish mh->rd with acquire/release.
    • Lookups/creation: acquire dictionary items while checking flags and destroy_lock in rrddim_find() and rrdset_create_custom() to avoid races with deletion.
    • Alerts: add rrdcalc_rrdset_read_lock()/unlock and use it in analytics, host status, health JSON, notifications, info API, health variable scoring, and chart variables API to guard rc->rrdset reads.
    • Badges: reject negative or oversized time values before size_t casts; render “undefined” instead of triggering undefined behavior.
  • Refactors

    • Add atomic helpers for RRDSET.last_accessed_time_s and replace all direct reads/writes (service cleanup, queries, API endpoints, obsolete/revive paths).
    • Update collectors/queries to read/write via the cached metric-handle buffer (mh->data) and tighten internal checks/log messages.

Written for commit b627b28. Summary will update on new commits.

Review in cubic

ktsaou added 8 commits July 5, 2026 10:10
Current issue: legacy RAM/ALLOC/NONE query handles can outlive the RRDDIM lock. The query setup duplicates the storage metric handle, releases the RRDDIM, and later reads legacy sample data through that handle. Dimension deletion released the owner handle and then freed or unmapped rd->db.data, so an in-flight query could dereference freed heap memory or an unmapped RAM buffer.

Fix: cache the legacy data buffer and buffer metadata in the refcounted RAM metric handle, detach deleting handles from the UUID index, and let the final handle release free the retained buffer. Same-UUID replacement now removes stale indexed handles and retries with a fresh handle instead of reattaching an old handle to a new RRDDIM.

Functional impact: no intended change to valid collection or query results. The deletion-race boundary changes from use-after-free or unmapped read to preserving the old buffer until outstanding references release it.

Performance impact: query reads use the cached handle data pointer instead of rd->db.data. Deletion and same-UUID replacement add rare Judy-index detach/retry work; the hot collection/query path has no new allocation or locking.

Validation: git diff --check passed for src/database/ram/rrddim_mem.c, src/database/ram/rrddim_mem.h, and src/database/rrddim.c. Direct compilation of rrddim.c and rrddim_mem.c with the worker CMake compile commands passed; the full worker object target was blocked by an unrelated missing generated ACLK proto input in the isolated worktree.

Reviewer verdicts: required reviewers approved the actual revised diff with no blockers. Optional review also approved. Nonblocking follow-up questions about broader legacy collect/query synchronization and in-place same-UUID snapshot semantics were recorded separately for decision.

(cherry picked from commit 74c77fc)
Current issue:

The badge time-unit formatter treats seconds, minutes, and hours as unsigned durations, but a finite negative metric value can reach the (size_t)value casts through badge rendering and shared alert formatting. Converting a negative floating value to size_t is undefined behavior.

Solution:

Reject finite negative values in the seconds, minutes, and hours branches before the cast and reuse the existing undefined output path used for NaN and infinity.

Current vs potential:

Current source bug with data-dependent public badge API and alert-formatting triggers.

Functional impact:

Positive finite values, zero, NaN, and infinity keep their existing output. Finite negative time-unit values now render undefined instead of invoking undefined behavior.

Performance impact:

Negligible: one extra floating-point comparison in each of the three special time-unit branches.
(cherry picked from commit dc58a8c)
Current issue:

rrddim_add_custom() used rrddim_index_find(), which returns the raw dictionary_get() value, then touched rd->destroy_lock before the insert/conflict path reacquired the dictionary item. dictionary_get() releases its lookup reference before returning, so a service-thread archive can delete the RRDDIM between the lookup and spinlock access.

Current vs potential:

Current concurrency-triggered lifetime bug. The unsafe code path exists in current source; a crash or use-after-free requires a racing obsolete-dimension delete.

Solution:

Acquire the RRDDIM dictionary item while the existing RRDDIM is dereferenced for the trylock/revive step, and release the acquired item on both success and retry paths. The dictionary_set_advanced() insert/conflict path and rrddim_add_custom() return contract are unchanged.

Functional impact:

No intended behavior change. Existing revive behavior is preserved; only the lifetime of the pointer used for destroy_lock is protected.

Performance impact:

Negligible. The existing-dimension path keeps the dictionary item acquired across the lock/revive dereference window; dictionary_get() already performed an acquire/release internally.

Security impact:

Reduces a possible use-after-free and crash window.

Validation:

git diff --check passed. The focused rrddim.c compile command from compile_commands.json passed with object output redirected to /tmp. The generated Ninja object target remains blocked in this checkout by unrelated missing ACLK protobuf input.

Reviewer gate:

The worker review gate approved with no blockers from minimax and deepseek. A focused final qwen review of the cherry-picked diff also approved with no blockers.

(cherry picked from commit 9d498b0)
Metric handles keep a back-pointer to the live RRDDIM while collectors can read it and dimension teardown can clear it from different threads. The previous plain loads and stores of mh->rd created a data race around that publication point: current flows exercise both paths when RAM-backed dimensions are collected while the dimension dictionary runs release/destructor work.

Use acquire/release helper accessors for the RRDDIM pointer and route all non-helper loads and stores through them. The fix preserves the existing pointer value, lifetime model, locking, retry behavior, and fatal consistency checks; it only makes the publication edge defined for concurrent readers and writers.

Functional impact: no intended behavior change. The same RRDDIM pointer is observed and cleared at the same call sites, with the same control flow and error handling.

Performance impact: minimal. The added acquire/release operations touch one pointer on get/release/collection consistency checks and do not add locks, allocations, or extra searches.

Validation: git diff --check; focused rrddim_mem.c object build passed. Required reviewers approved the one-issue cluster.
(cherry picked from commit dfea9d3)
rrddim_find() and rrdset_create_custom() dereferenced values returned from dictionary helpers after the helper had already released the dictionary item reference. Concurrent deletion can therefore free or clear the embedded RRDDIM/RRDSET while these functions are still checking flags, touching destroy_lock, or updating last-access metadata.

This is a current source bug in exercised lookup/create paths, although it requires a concurrent delete race to manifest at runtime. The public raw-pointer return contracts are intentionally unchanged; the fix only protects dereferences performed inside these functions.

Use acquired dictionary items while reading the dimension/chart values locally, release every acquired item on the same control path, and keep the existing add-or-existing behavior for chart creation by switching the add helper to the acquired-item variant.

Functional impact: valid lookups, obsolete filtering, chart conflict handling, name-index updates, metadata updates, and returned pointer contracts are unchanged. The only intended behavior change is preventing local use of a value after the dictionary is allowed to delete it.

Performance impact: adds one dictionary item acquire/release pair to rrddim_find() and up to two acquire/release pairs to rrdset_create_custom(). This is small atomic refcount overhead on lookup/create paths and does not add a new blocking lock around the protected dereferences.

(cherry picked from commit 0d8b1e5)
Root cause: several alert summary/export readers checked RRDCALC.rrdset and then dereferenced linked chart fields while rrdcalc_unlink_from_rrdset() can clear that link under the chart alert spinlock. The same check/use race existed in the raised-alert notification summary path.

Current vs potential: current bug. These readers are exercised by the info API alert summary, host status health aggregation, health JSON export, analytics counters, and raised-alert notification summary.

Solution: add a small RRDCALC-to-RRDSET read-lock helper that snapshots rc->rrdset, takes the chart alert read lock, rechecks the link, and returns NULL when the alert is concurrently unlinked. Use it at the affected readers before touching chart fields.

Functional impact: stable linked alerts produce the same counts and JSON fields. During concurrent unlink, the reader now skips the alert instead of racing a NULL or stale rc->rrdset dereference.

Performance impact: adds one chart alert read lock/unlock around linked-alert summary reads. There is no new allocation, I/O, network access, or asymptotic traversal change.

(cherry picked from commit 60e6c74)
Current issue: the seconds, minutes, and hours badge-duration branches accepted finite positive NETDATA_DOUBLE values after only negative, NaN, and infinity checks, then cast them to size_t. C floating-to-integer conversion is undefined when the integral part is not representable by size_t, so public badge formatting and shared alert-value formatting could reach undefined behavior for oversized duration values.

Root cause: the invalid-value guard matched the older negative/NaN/Inf failure cases but did not encode the size_t conversion boundary. The three branches duplicated that incomplete predicate immediately before their size_t casts.

Solution: add a shared badge_time_value_is_undefined() predicate and use it before all three duration casts. The predicate keeps the existing negative, NaN, and infinity behavior and adds the exclusive upper bound SIZE_MAX + 1 in NETDATA_DOUBLE space, preserving exactly representable SIZE_MAX values while rejecting values whose integral part cannot fit in size_t.

Functional impact: valid finite positive duration values below the exclusive size_t bound keep the same formatting, zero still renders as now, and negative, NaN, and infinity still render as undefined. Only oversized finite duration values change from undefined C behavior to the existing undefined output used for invalid time-unit values.

Caller-contract impact: format_value_and_unit() still writes into and returns the caller-provided buffer. Badge SVG, health log, SQLite health loading/export, and ACLK alert serialization callers keep the same ownership and string-consumption contracts; they only receive the existing undefined string for a newly guarded invalid input case.

Performance impact: negligible. The change adds one floating-point comparison in the three special time-unit branches only, with no allocation, I/O, locking, loop, shared state, or non-time-unit formatting impact.

(cherry picked from commit e0c382f)
RRDSET.last_accessed_time_s is written from chart lookup, dimension lookup, API response, query-target, obsolete-flag, and dictionary-react paths, while the service cleanup path reads it when deciding whether an obsolete chart is old enough to delete. Those paths can run concurrently, so the plain C load/store pair is a current source-level data race on the timestamp field.

Add small static inline helpers for relaxed atomic load, store, and touch operations, matching the existing relaxed-atomic metadata helper style in rrdset.h. Convert every direct access to the helpers so the field is never mixed between atomic and non-atomic access.

Current vs potential: current data race in the codebase. Runtime symptoms are potential and limited to an eviction hint: stale or torn last-access values could affect obsolete chart cleanup timing.

Functional impact: values written are unchanged, cleanup compares the same timestamp against the same threshold, and dictionary ownership, chart lifetime, locks, return values, public APIs, storage data, and wire formats are unchanged.

Performance impact: one relaxed atomic load or store replaces each plain timestamp access. No locks, allocations, syscalls, additional scans, or new runtime paths are introduced.

(cherry picked from commit c3331d1)

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 19 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Re-trigger cubic

Root cause: two same-class RRDCALC.rrdset readers were still outside the read-lock helper added for alert summary paths. health_api_v1_chart_variables2json() read rc->rrdset fields while the host alert dictionary entry could still exist after rrdcalc_unlink_from_rrdset() cleared the chart link. variable_lookup_add_result_with_score() checked the current alert link and then dereferenced its rrdlabels without stabilizing the link. Both are current races in exercised API and health expression lookup paths.

Solution: use rrdcalc_rrdset_read_lock() at both remaining dereference sites, read the chart fields only while the chart alert read lock validates that rc->rrdset is still linked, and keep the existing NULL-link behavior when the helper returns NULL.

Functional impact: stable linked alerts keep the same JSON fields, values, and label scores. During concurrent unlink, the API skips the unlinked alert and expression scoring preserves the existing NULL-link score behavior instead of dereferencing a stale or NULL chart link.

Performance impact: adds one per-chart alert read lock/unlock around the existing field snapshot or label-score calculation. There is no new I/O, allocation pattern, broad traversal, or asymptotic change.

(cherry picked from commit 8888ea8)
@sonarqubecloud

sonarqubecloud Bot commented Jul 5, 2026

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/health/health_variable.c
@stelfrag stelfrag marked this pull request as ready for review July 5, 2026 20:17
Copilot AI review requested due to automatic review settings July 5, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants