fix: compare timestamps for partitions metadata last_updated_at#3603
Open
anxkhn wants to merge 1 commit into
Open
fix: compare timestamps for partitions metadata last_updated_at#3603anxkhn wants to merge 1 commit into
anxkhn wants to merge 1 commit into
Conversation
`InspectTable._update_partitions_map_from_manifest_entry` aggregates per-partition stats over manifest entries, which are iterated in manifest order rather than chronologically. The guard that keeps the most recently committed snapshot per partition compared the stored `last_updated_snapshot_id` (a snapshot id) against `snapshot.timestamp_ms` (a commit timestamp). Snapshot ids are large positive 63-bit integers, so `id < timestamp_ms` is effectively always false and the row keeps whichever entry was visited first instead of the newest one, reporting the wrong `last_updated_at` / `last_updated_snapshot_id` for any partition touched by more than one snapshot. Compare the stored `last_updated_at` timestamp against the incoming `snapshot.timestamp_ms`, matching the Java `PartitionsTable.Partition.update` which uses `snapshotCommitTime > this.lastUpdatedAt`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The second clause compares the stored
last_updated_snapshot_id(a snapshot id)against
snapshot.timestamp_ms(a commit timestamp) - two unrelated quantities.Snapshot ids are large positive 63-bit integers (order 1e18), while millisecond
timestamps are order 1e12, so
id < timestamp_msis effectively always false. Therow therefore keeps whichever manifest entry happened to be visited first rather
than the newest snapshot, so the two
last_updated_*columns can point at an oldersnapshot.
This mirrors the Java implementation, where
PartitionsTable.Partition.updatedecides the last-updated snapshot by comparingcommit timestamps (
this.lastUpdatedAt == null || snapshotCommitTime > this.lastUpdatedAt).The fix compares the stored timestamp against the incoming timestamp:
This only affects the read-only
partitionsmetadata table output; it does notchange any table data or metadata on disk.
Are these changes tested?
Yes. A new parametrized unit test in
tests/table/test_inspect.py(
test_partitions_last_updated_uses_latest_snapshot_regardless_of_order) calls_update_partitions_map_from_manifest_entrydirectly for the same partition withan older and a newer snapshot, in both visitation orders, and asserts that
last_updated_at/last_updated_snapshot_idreflect the newer snapshot in bothcases. It fails on the previous code (the older-first order keeps
last_updated_at = 1000instead of5000) and passes with the fix.make lint(ruff, ruff-format, mypy) andpytest tests/table/test_inspect.pyboth pass. The Spark-parity coverage in
tests/integration/test_inspect_table.py::test_inspect_partitions_partitionedexercises this column against Spark but requires Docker + Spark, so the added
unit test provides a deterministic, order-independent regression check.
Are there any user-facing changes?
Yes.
Table.inspect.partitions()now reports the correctlast_updated_atandlast_updated_snapshot_idfor partitions modified by multiple snapshots. Thereare no API or schema changes.