Skip to content

fix: prevent replicas from restoring old-timeline WAL segments#10394

Open
IgorOhrimenko wants to merge 1 commit intocloudnative-pg:mainfrom
IgorOhrimenko:fix/validate-wal-segment-timeline
Open

fix: prevent replicas from restoring old-timeline WAL segments#10394
IgorOhrimenko wants to merge 1 commit intocloudnative-pg:mainfrom
IgorOhrimenko:fix/validate-wal-segment-timeline

Conversation

@IgorOhrimenko
Copy link
Copy Markdown

@IgorOhrimenko IgorOhrimenko commented Mar 31, 2026

After a switchover or failover, the WAL archive (S3/object storage) still contains segments from previous timelines. When a replica restarts with existing PVC data (e.g. after a rolling restart triggered by parameter change), its restore_command can fetch old-timeline WAL segments from the archive before streaming replication reconnects to the new primary.

This causes the replica's timeline history to diverge from the current primary, resulting in either:

CrashLoopBackOff:

LOG: restored log file "00000002.history" from archive
LOG: restored log file "00000003.history" from archive
LOG: restored log file "00000001000000000000004B" from archive   ← timeline 1!
LOG: restored log file "000000010000000000000048" from archive   ← timeline 1!
LOG: entering standby mode
FATAL: requested timeline 3 is not a child of this server's history
LOG: startup process (PID 34) exited with exit code 1
LOG: database system is shut down

Stuck in "Standby (file based)" mode:

LOG: restored log file "0000000100000032000000E8" from archive   ← timeline 1!
LOG: invalid record length at 32/E80000A0: expected at least 24, got 0
LOG: fetching timeline history file for timeline 2 from primary server
FATAL: could not receive timeline history file from the primary server: ERROR:
  could not open file "pg_wal/00000002.history": No such file or directory
LOG: waiting for WAL to become available at 32/E80000B8

Root cause

validateTimelineHistoryFile() in walrestore/cmd.go validates .history files but does not check regular WAL segments. A replica can successfully download 000000010000000000000048 (timeline 1) from the archive when the primary is already on timeline 3.

Fix

Added validateWALSegmentTimeline() which rejects WAL segments whose timeline is older than the cluster's current timeline, for replica instances in established clusters. This forces PostgreSQL to fall back to streaming replication from the current primary instead of consuming stale WAL from the archive.

The check is skipped when CurrentPrimary is not set (during bootstrap or PITR recovery) to allow fetching WAL from any timeline.

Reproduction

Minimal reproducer using kind + MinIO: reproduce.sh

  1. Create a 3-instance cluster with synchronous replication and WAL archiving to MinIO
  2. Write data, take backup, write more data
  3. Change shared_buffers to trigger a rolling restart with switchover
  4. Replica restarts with existing PVC, restore_command fetches old-timeline WAL from MinIO
  5. Replica crashes or gets stuck in file-based mode

In the provided test environment (kind, single-node, local MinIO, synchronous replication), the bug reproduces consistently. In multi-node production clusters the issue is intermittent, depending on the timing between switchover completion and replica reconnection.

Without the fix: replica enters CrashLoopBackOff or Standby (file based).
With the fix: old WAL is rejected with warning log, replica uses streaming replication:

WARNING: Refusing to restore old-timeline WAL segment for replica
  walName=000000010000000000000048 walTimeline=1 clusterTimeline=3

Testing

  • Unit tests added for validateWALSegmentTimeline covering all branches
  • Verified recovery from S3 backup works correctly (PITR not broken)
  • Verified rolling restarts with switchover work without file-based or CrashLoop

Closes #4990
Related: #4188, #3344

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Mar 31, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Mar 31, 2026
@IgorOhrimenko IgorOhrimenko marked this pull request as draft March 31, 2026 15:52
@IgorOhrimenko IgorOhrimenko force-pushed the fix/validate-wal-segment-timeline branch 2 times, most recently from ff838b3 to 7c37404 Compare March 31, 2026 16:17
@IgorOhrimenko IgorOhrimenko marked this pull request as ready for review March 31, 2026 16:27
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Mar 31, 2026
After a switchover or failover, the WAL archive still contains segments
from previous timelines. When a replica restarts with existing PVC data,
its restore_command can fetch these old-timeline WAL segments from the
archive, causing the replica's timeline history to diverge from the
current primary.

This results in either:
- CrashLoopBackOff: "requested timeline N is not a child of this
  server's history"
- Replica stuck in "Standby (file based)" mode, unable to stream

The existing validateTimelineHistoryFile only checks .history files.
This commit adds validateWALSegmentTimeline which also rejects regular
WAL segments whose timeline is older than the cluster's current
timeline, for replicas in established clusters.

The check is skipped when CurrentPrimary is not set (during bootstrap
or PITR recovery) to allow fetching WAL from any timeline.

Closes: cloudnative-pg#4990
Signed-off-by: Igor Ohrimenko <igor.ohrimenko@travelata.ru>
@IgorOhrimenko IgorOhrimenko force-pushed the fix/validate-wal-segment-timeline branch from a9319d5 to e3eef93 Compare April 1, 2026 09:32
@gbartolini
Copy link
Copy Markdown
Contributor

Hi @IgorOhrimenko. What if I request a particular timeline for the recovery? I believe that we should let Postgres handle that process, not CNPG. What are your thoughts?

@IgorOhrimenko
Copy link
Copy Markdown
Author

IgorOhrimenko commented Apr 3, 2026

Hi @gbartolini, thanks for raising this important architectural concern.
You're absolutely right that timeline management should ideally be handled by PostgreSQL itself, not by CNPG. The restore_command should be a simple fetcher without business logic.
However, this specific issue has been blocking CNPG adoption in production for years. Many teams hesitate to deploy CNPG because replicas can crash-loop or get stuck in "standby (file based)" mode after switchovers. The problem is that PostgreSQL's recovery algorithm can fetch old-timeline WAL segments before reading the .history files that would tell it those segments are incompatible.
Our fix in CNPG is a pragmatic workaround:

  • It only applies to replicas in established clusters (not during bootstrap/PITR, not for primaries)
  • It returns ErrWALNotFound to force fallback to streaming replication
  • It prevents data divergence and crash loops that PostgreSQL would otherwise hit later

The reproduction script in this PR shows the issue occurs consistently. We've already patched our production clusters with this fix (built custom binaries/Docker images), and it solves the problem immediately.
Long-term, PostgreSQL should indeed handle this better—perhaps by checking timeline compatibility earlier or prioritizing streaming replication more aggressively after failovers. But getting changes into PostgreSQL core takes years. Meanwhile, operators need stable clusters today.
If we merge this fix, we could add a code comment like:

// TODO: Remove this validation when PostgreSQL handles old-timeline
// WAL rejection during replica recovery (see PostgreSQL issue #XYZ).

Should we open a PostgreSQL issue to track this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working release-1.25 release-1.28 release-1.29 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Replica does not come up: checkpoint not found in timeline

4 participants