fix: prevent replicas from restoring old-timeline WAL segments#10394
fix: prevent replicas from restoring old-timeline WAL segments#10394IgorOhrimenko wants to merge 1 commit intocloudnative-pg:mainfrom
Conversation
|
❗ By default, the pull request is configured to backport to all release branches.
|
ff838b3 to
7c37404
Compare
After a switchover or failover, the WAL archive still contains segments from previous timelines. When a replica restarts with existing PVC data, its restore_command can fetch these old-timeline WAL segments from the archive, causing the replica's timeline history to diverge from the current primary. This results in either: - CrashLoopBackOff: "requested timeline N is not a child of this server's history" - Replica stuck in "Standby (file based)" mode, unable to stream The existing validateTimelineHistoryFile only checks .history files. This commit adds validateWALSegmentTimeline which also rejects regular WAL segments whose timeline is older than the cluster's current timeline, for replicas in established clusters. The check is skipped when CurrentPrimary is not set (during bootstrap or PITR recovery) to allow fetching WAL from any timeline. Closes: cloudnative-pg#4990 Signed-off-by: Igor Ohrimenko <igor.ohrimenko@travelata.ru>
a9319d5 to
e3eef93
Compare
|
Hi @IgorOhrimenko. What if I request a particular timeline for the recovery? I believe that we should let Postgres handle that process, not CNPG. What are your thoughts? |
|
Hi @gbartolini, thanks for raising this important architectural concern.
The reproduction script in this PR shows the issue occurs consistently. We've already patched our production clusters with this fix (built custom binaries/Docker images), and it solves the problem immediately. Should we open a PostgreSQL issue to track this? |
After a switchover or failover, the WAL archive (S3/object storage) still contains segments from previous timelines. When a replica restarts with existing PVC data (e.g. after a rolling restart triggered by parameter change), its
restore_commandcan fetch old-timeline WAL segments from the archive before streaming replication reconnects to the new primary.This causes the replica's timeline history to diverge from the current primary, resulting in either:
CrashLoopBackOff:
Stuck in "Standby (file based)" mode:
Root cause
validateTimelineHistoryFile()inwalrestore/cmd.govalidates.historyfiles but does not check regular WAL segments. A replica can successfully download000000010000000000000048(timeline 1) from the archive when the primary is already on timeline 3.Fix
Added
validateWALSegmentTimeline()which rejects WAL segments whose timeline is older than the cluster's current timeline, for replica instances in established clusters. This forces PostgreSQL to fall back to streaming replication from the current primary instead of consuming stale WAL from the archive.The check is skipped when
CurrentPrimaryis not set (during bootstrap or PITR recovery) to allow fetching WAL from any timeline.Reproduction
Minimal reproducer using kind + MinIO: reproduce.sh
shared_buffersto trigger a rolling restart with switchoverrestore_commandfetches old-timeline WAL from MinIOIn the provided test environment (kind, single-node, local MinIO, synchronous replication), the bug reproduces consistently. In multi-node production clusters the issue is intermittent, depending on the timing between switchover completion and replica reconnection.
Without the fix: replica enters CrashLoopBackOff or Standby (file based).
With the fix: old WAL is rejected with warning log, replica uses streaming replication:
Testing
validateWALSegmentTimelinecovering all branchesCloses #4990
Related: #4188, #3344