Skip to content

refreshDataParts catch block missing scheduleAfter causes permanent task death on transient error #102045

@clickgapai

Description

@clickgapai

Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.

Retrospective finding from a historical scan of PR #76467 (merged 2025-04-11). Confirmed on current codebase — close with a note if already fixed.

Describe what's wrong

If any exception occurs during MergeTreeData::refreshDataParts (e.g., temporary disk unavailability, network timeout), the background refresh task permanently stops and never runs again, making the readonly table stale forever

Root cause: MergeTreeData.cpp:2717-2720: catch block missing refresh_parts_task->scheduleAfter(interval_milliseconds) — compare with refreshStatistics at lines 2759-2766 which has this correctly

Why we believe this is a bug: MergeTreeData::refreshDataParts (MergeTreeData.cpp:2621) uses function-try-catch pattern. The scheduleAfter call is at line 2715 inside the try block. The catch block (lines 2717-2720) only logs the error but does NOT reschedule the task, unlike refreshStatistics (lines 2759-2766) which properly reschedules in its catch block.

Affected locations:

  • src/Storages/MergeTree/MergeTreeData.cpp:2717 — catch block of refreshDataParts missing scheduleAfter

Impact: A single transient error (network hiccup, temporary disk unavailability) permanently disables background data refresh for readonly MergeTree tables. The table becomes stale until server restart.

Does it reproduce on most recent release?

Likely yes — see testability note in additional context.

How to reproduce

This is a code-level bug identified through source analysis. See root cause and affected locations above for the specific code paths involved.

Expected behavior

The code should not exhibit the behavior described in the root cause above.

Error message and/or stacktrace

See root cause description above.

Additional context

Open risks:

  • Any code path that throws in refreshDataParts triggers this — disk iteration, part loading, part commit all can throw

Suggested fix: Add refresh_parts_task->scheduleAfter(interval_milliseconds) to the catch block, matching the pattern used in refreshStatistics at line 2762-2763

Analysis details: Confidence HIGH | Severity P1 | Testability: THEORETICAL

Found during automated review of PR #76467.


ClickGapAI · Confidence: HIGH · Severity: P1 · Finding: h_pr76467_001

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions