Skip to content

[fix](cloud) Avoid recycle the instance which has successor#62234

Merged
gavinchou merged 1 commit intoapache:masterfrom
w41ter:snapshot/fix/avoid_recycle_rollbacked_instance_directly
Apr 9, 2026
Merged

[fix](cloud) Avoid recycle the instance which has successor#62234
gavinchou merged 1 commit intoapache:masterfrom
w41ter:snapshot/fix/avoid_recycle_rollbacked_instance_directly

Conversation

@w41ter
Copy link
Copy Markdown
Contributor

@w41ter w41ter commented Apr 8, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Because the successor instance may still need the data

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Because the successor instance may still need the data
@w41ter w41ter added the cloud label Apr 8, 2026
@w41ter w41ter requested a review from gavinchou as a code owner April 8, 2026 10:54
@w41ter w41ter added the snapshot label Apr 8, 2026
@w41ter w41ter requested a review from dataroaring as a code owner April 8, 2026 10:54
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 8, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@w41ter
Copy link
Copy Markdown
Contributor Author

w41ter commented Apr 8, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 6.25% (2/32) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.48% (1798/2291)
Line Coverage 64.13% (32287/50344)
Region Coverage 65.10% (16253/24967)
Branch Coverage 55.61% (8686/15620)

@gavinchou
Copy link
Copy Markdown
Contributor

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

PR approved by anyone and no changes requested.

Copy link
Copy Markdown
Contributor

@gavinchou gavinchou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

  1. High: cloud/src/recycler/recycler.cpp now skips KV deletion whenever the successor instance key still exists, but I could not find any production path that ever clears successor_instance_id on the deleted source instance after the handoff finishes. SnapshotManager::decouple_instance() only clears source_instance_id / source_snapshot_id on the successor, and the only set_successor_instance_id() usage in this repo is the test setup. That means a deleted source instance will keep returning early here for as long as its successor instance exists, so its meta/versioned KV ranges are never reclaimed. This fixes the immediate data-loss concern by turning it into a permanent recycle leak.

Critical checkpoint conclusions:

  • Goal of the task: protect metadata still needed by a rollback successor. Current code does not achieve that safely end-to-end because it blocks recycler progress on successor existence alone, and there is no test proving eventual cleanup.
  • Small/clear/focused: the patch is small, but the lifecycle condition is incomplete.
  • Concurrency: no new in-memory lock ordering issue found; this path uses txn-kv operations only.
  • Lifecycle management: this is the main problem. The deleted-source/successor lifecycle is not closed because successor_instance_id is never cleared in production code.
  • Configuration: no new config.
  • Compatibility: no FE/BE or storage-format compatibility issue introduced.
  • Parallel code paths: no equivalent recycler path appears to handle this cleanup elsewhere.
  • Conditional checks: the new guard lacks a proven terminal condition; mere successor existence is too broad.
  • Test coverage: insufficient. Existing test coverage only exercises resource-manager refresh behavior, not deleted-instance recycling or eventual cleanup.
  • Observability: logging is acceptable for diagnosing this path.
  • Transaction/persistence: yes, this changes persistent cleanup behavior, and the new condition can leave persistent KV data behind indefinitely.
  • Data write/modification safety: not safe from a cleanup-progress perspective because deleted instances can become unrecyclable forever.
  • FE/BE variable propagation: not applicable.
  • Performance: repeated recycler scans on permanently skipped deleted instances can accumulate avoidable background work.
  • Other issues: none beyond the finding above.

Recommend guarding on the actual remaining dependency, or clearing successor_instance_id when the successor no longer needs the source instance metadata, and adding a recycler test for that lifecycle.

Comment thread cloud/src/recycler/recycler.cpp
@gavinchou gavinchou merged commit 589b6a5 into apache:master Apr 9, 2026
33 of 34 checks passed
@w41ter w41ter deleted the snapshot/fix/avoid_recycle_rollbacked_instance_directly branch April 9, 2026 07:01
github-actions Bot pushed a commit that referenced this pull request Apr 9, 2026
Because the successor instance may still need the data
yiguolei pushed a commit that referenced this pull request Apr 10, 2026
#62234 (#62268)

Cherry-picked from #62234

Co-authored-by: walter <maochuan@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants