Skip to content

Netdata fixes part 32#22972

Open
stelfrag wants to merge 6 commits into
netdata:masterfrom
stelfrag:netdata-fixes-32
Open

Netdata fixes part 32#22972
stelfrag wants to merge 6 commits into
netdata:masterfrom
stelfrag:netdata-fixes-32

Conversation

@stelfrag

@stelfrag stelfrag commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator
Summary

Summary by cubic

Fixes edge cases in time and string handling to prevent false hibernation/stale states and remove undefined behavior across health, pluginsd, facets, registry, and label parsing. Adds label matcher tests for '=' edge cases.

  • Bug Fixes
    • Health: compute deltas only when clocks move forward and compare without overflow to avoid false hibernation detection.
    • Pluginsd: clamp vnode age when now <= last_seen to prevent false stale transitions on clock rollback.
    • Facets: reject invalid facet hash strings and clear transient current-value fields on reset to avoid stale reads.
    • Registry: trim trailing spaces without forming a pointer before the string start.
    • Labels: guard pattern trimming around '=' and stop on terminal '='; add unit tests for empty-value and malformed '=' matchers.

Written for commit 5b682c6. Summary will update on new commits.

Review in cubic

ktsaou and others added 6 commits July 5, 2026 09:16
simple_pattern_trim_around_equal() currently reads before its mallocz buffer when a label matcher starts with '='. A trailing '=' also lets the '=' branch copy the NUL and advance the input pointer past the terminator, so the next loop condition reads beyond the string; depending on adjacent heap contents the leading case could also move dst before the allocation and write there.

Guard the previous-byte check with dst > store and break after processing terminal '=' so non-terminal fall-through behavior stays byte-for-byte unchanged.

Functional impact: preserves existing host/chart label matcher parsing, including malformed leading and trailing '=' semantics covered by rrdlabels tests.

Performance impact: one pointer comparison only on '=' bytes, no new allocations, and no extra pass over the input.

Security impact: removes undefined out-of-bounds reads and potential one-byte output underflow/overflow paths.

Validation: git diff --check; ASAN CMake configuration; ASAN object build for health_prototypes.c and rrdlabels.c; ASAN build-asan-min/netdata -W rrdlabelstest.

Independent source review: three required reviewers approved with no blockers.

(cherry picked from commit a2aeaeb)
(cherry picked from commit 57ac9b93f167d0d0fba32c18e71860dc4bc6cf69)
MEM-pluginsd-004 is a potential integer underflow: pluginsd_host() reconstructs vnode last_seen from the parser-local Judy array and casts (now - last_seen) to uint32_t. If CLOCK_REALTIME moves backward after the vnode timestamp is refreshed, the negative delta wraps to a huge age and can incorrectly mark an active virtual node stale.

Current vs potential: HOST localhost stale scanning is a current PLUGINSD parser path, but the faulty branch requires a wall-clock rollback or equivalent realtime skew, so this is classified as potential.

Functional impact: normal stale detection is unchanged when now is after last_seen. Future last_seen values are treated as age 0, preventing false vnode stale transitions, online flag churn, node-state updates, and streaming stop signaling caused by clock skew.

Performance impact: adds one signed time_t comparison per tracked vnode during stale scans; no allocations, locks, or additional scans.

Solution: compute seen_seconds_ago only when now is after last_seen and clamp future or equal timestamps to zero before comparing with node_stale_after_seconds.
(cherry picked from commit fd2c57d)
(cherry picked from commit c9b0e14ff648b9aae3e77d7ee7c618b73e28758e)
facets_reset_key() marked a key as FACET_KEY_VALUE_NONE and reset the hash, but left the transient current-value payload from the previous row in place. That included the borrowed raw pointer, raw length, reusable buffer length, and current FACET_VALUE pointer.

Current vs potential: current valid flows appear to gate current-value reads on FACET_KEY_VALUE_UPDATED or repopulate missing values before reading them, so this is a potential stale-state bug rather than a proven currently exercised read. Leaving stale payload fields after clearing the update flag still makes future or accidental ungated reads observe prior-row state.

Solution: reset raw, raw_len, b->len, and v at the same point that the key is marked as having no current value. This matches the existing empty and unsampled setter pattern for the raw fields while preserving ownership of the key-owned reusable buffer.

Functional impact: valid facet processing is unchanged because valid readers must not consume current-value payload after FACET_KEY_VALUE_NONE and the next value assignment repopulates these fields before use. Invalid stale reads now see null or empty reset state instead of a previous row's payload.

Performance impact: negligible. The reset helper gains four scalar stores and adds no allocation, freeing, locking, hashing, branching, or copying.

(cherry picked from commit 08d9439)
(cherry picked from commit 0088b803aaa95f873c4ac27ce8df9e0c5203e469)
Issue: check_if_resumed_from_suspension() subtracted realtime and monotonic samples as unsigned usec_t values before proving the clocks had moved forward. A backward CLOCK_REALTIME step could wrap realtime - last_realtime into a huge value and falsely trigger hibernation delay.

Current vs potential: the arithmetic bug is current in the source. Runtime impact is potential because it requires a backward realtime sample between health loop iterations.

Solution: require realtime to move forward and monotonic time to not move backward before computing deltas, then compare realtime_delta > 2 * monotonic_delta with subtraction instead of multiplication to avoid unsigned overflow.

Functional impact: normal forward-clock hibernation detection is unchanged. Backward or non-forward samples now return false instead of feeding wrapped deltas into the detector.

Performance impact: two branch predicates and two local delta variables once per health loop iteration; no allocation, locking, I/O, or lifecycle changes.

Validation: git diff --check passed; the touched health_event_loop object compiled with low-priority build; codegraph and rg found one direct caller; external reviewers glm, minimax, deepseek, and kimi approved the actual diff.
(cherry picked from commit 732a609)
(cherry picked from commit 450e9d0674055514326cae86c053649c60a5c3bf)
str_to_facets_hash() decoded eleven fixed character offsets before validating the local string contract. The current code paths already validate externally supplied facet ids before they reach this helper, so this is a potential helper-local safety bug rather than a currently exercised public-input crash path.

Keep the fix surgical by rejecting NULL and any string whose bounded length is not the canonical FACET_STRING_HASH_SIZE - 1 before the fixed-offset decode. Valid generated or validated facet ids still decode exactly as before, while invalid direct helper inputs return FACETS_HASH_ZERO, matching the existing sentinel behavior used by the caller paths.

Functional impact: none for valid facet ids and no intended behavior change for the existing request flows; invalid helper-local input is rejected earlier instead of being decoded from undersized data.

Performance impact: negligible. The added strnlen() is bounded to FACET_STRING_HASH_SIZE and runs only while parsing or registering selected facet ids, not in the per-row log scan path.

Validation: worker traced direct callers with codegraph and rg, verified request-path impact, ran git diff --check, configured CMake with low-priority scheduling, and built the libnetdata target. Reviewer gate: glm, minimax, deepseek, and kimi approved after semantic review; minimax required a same-scope rerun because the first output was incomplete.
(cherry picked from commit 722b6fc)
(cherry picked from commit f0c95f780a4feb12d3041cc4a8a7a12ae24ccd75)
registry_fix_machine_name() normalizes leading, internal, and trailing whitespace by walking a mutable string. Its old trailing-trim loop used while (--t >= s), which can form a pointer one byte before the start of the string before the relational bounds check is evaluated. For empty or all-trimmed normalized input, that is undefined behavior even when the pointer is not dereferenced.

Current vs potential bug: the undefined pointer expression is source-level current in the helper for empty normalized input. Runtime reachability is plausible through registry request or log-replay inputs, but the fix does not depend on proving a crash path because the existing loop violates the C pointer bounds contract.

Solution: keep t as the one-past-end pointer from the forward scan and trim with while (t > s) plus t[-1]. When the previous character is a space, decrement t inside the assignment and write the terminator; otherwise stop. The final length remains t - s.

Functional impact: intended behavior is unchanged. The helper still skips leading whitespace, converts internal whitespace to spaces, trims trailing spaces in place, returns the same normalized pointer, and reports the same length for valid inputs.

Performance impact: neutral. The algorithm remains a forward O(n) normalization pass plus backward O(n) trim pass, with no allocation, locking, hashing, I/O, or dictionary work.

Validation: inspected the staged diff and kept the change limited to registry_fix_machine_name(). Worker validation passed whitespace checks, a low-priority CMake configure, and a low-priority registry_internals object build. Required local reviewer quorum approved the source-level fix.
(cherry picked from commit 5ab5428)
(cherry picked from commit de24a1dd6a3d940534ef8a1e024b673433eaa45c)

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Health as Health Event Loop
    participant Pluginsd as Pluginsd Parser
    participant Facets as Facets Engine
    participant Registry as Registry Internals
    participant Labels as RRDLabels Unit Test

    Note over Health,Labels: Runtime flow for edge case fixes

    Health->>Health: check_if_resumed_from_suspension()
    Health->>Health: Get realtime and monotonic clocks
    alt Clock moving forward
        Health->>Health: Compute realtime_delta and monotonic_delta
        alt realtime_delta > monotonic_delta AND realtime_delta - monotonic_delta > monotonic_delta
            Health->>Health: Set resumed from suspension flag
        end
    else Clock rollback / stalled
        Health->>Health: Skip detection (no false hibernation)
    end

    Pluginsd->>Pluginsd: pluginsd_host() processing vnode age
    Note over Pluginsd: Clamp seen_seconds_ago to 0 when now <= last_seen
    Pluginsd->>Pluginsd: Compute (now > last_seen) ? now - last_seen : 0
    alt seen_seconds_ago >= stale_after_seconds
        Pluginsd->>Pluginsd: Clear RRDHOST_OPTION_VIRTUAL_HOST (stale)
    end

    Facets->>Facets: str_to_facets_hash() with input string
    opt Input is NULL or not exact hash length
        Facets->>Facets: Return FACETS_HASH_ZERO (reject invalid)
    end

    Facets->>Facets: facets_reset_key() for key cleanup
    Facets->>Facets: Clear current_value.raw, raw_len, b->len, v
    Facets->>Facets: Reset transient fields to avoid stale reads

    Registry->>Registry: registry_fix_machine_name() trimming trailing spaces
    loop While t > s
        alt t[-1] == ' '
            Registry->>Registry: Decrement t and set to null
        else
            Registry->>Registry: Break on non-space
        end
    end
    Note over Registry: No pointer before start of string

    Labels->>Labels: rrdlabels_unittest_check_pattern_list()
    Labels->>Labels: Test =_hostname (false - invalid pattern)
    Labels->>Labels: Test _hostname= (true - empty value OK)
    Labels-->>Labels: Assert match/no-match as expected
Loading

Re-trigger cubic

@sonarqubecloud

sonarqubecloud Bot commented Jul 5, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
D Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@stelfrag stelfrag marked this pull request as ready for review July 5, 2026 20:16
Copilot AI review requested due to automatic review settings July 5, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants