Skip to content

fix: compare timestamps for partitions metadata last_updated_at#3603

Open
anxkhn wants to merge 1 commit into
apache:mainfrom
anxkhn:patch-11
Open

fix: compare timestamps for partitions metadata last_updated_at#3603
anxkhn wants to merge 1 commit into
apache:mainfrom
anxkhn:patch-11

Conversation

@anxkhn

@anxkhn anxkhn commented Jul 5, 2026

Copy link
Copy Markdown
# Rationale for this change

`Table.inspect.partitions()` reports the wrong `last_updated_at` and
`last_updated_snapshot_id` for any partition that has been touched by more than
one snapshot (for example repeated inserts into the same partition value).

`InspectTable._update_partitions_map_from_manifest_entry` aggregates per-partition
stats across manifest entries. Those entries are iterated in manifest order, which
is not chronological, so the aggregation has to keep the entry with the most recent
commit timestamp per partition. The overwrite guard did:

```python
if partition_row["last_updated_at"] is None or partition_row["last_updated_snapshot_id"] < snapshot.timestamp_ms:

The second clause compares the stored last_updated_snapshot_id (a snapshot id)
against snapshot.timestamp_ms (a commit timestamp) - two unrelated quantities.
Snapshot ids are large positive 63-bit integers (order 1e18), while millisecond
timestamps are order 1e12, so id < timestamp_ms is effectively always false. The
row therefore keeps whichever manifest entry happened to be visited first rather
than the newest snapshot, so the two last_updated_* columns can point at an older
snapshot.

This mirrors the Java implementation, where
PartitionsTable.Partition.update decides the last-updated snapshot by comparing
commit timestamps (this.lastUpdatedAt == null || snapshotCommitTime > this.lastUpdatedAt).

The fix compares the stored timestamp against the incoming timestamp:

if partition_row["last_updated_at"] is None or partition_row["last_updated_at"] < snapshot.timestamp_ms:

This only affects the read-only partitions metadata table output; it does not
change any table data or metadata on disk.

Are these changes tested?

Yes. A new parametrized unit test in tests/table/test_inspect.py
(test_partitions_last_updated_uses_latest_snapshot_regardless_of_order) calls
_update_partitions_map_from_manifest_entry directly for the same partition with
an older and a newer snapshot, in both visitation orders, and asserts that
last_updated_at / last_updated_snapshot_id reflect the newer snapshot in both
cases. It fails on the previous code (the older-first order keeps
last_updated_at = 1000 instead of 5000) and passes with the fix.

make lint (ruff, ruff-format, mypy) and pytest tests/table/test_inspect.py
both pass. The Spark-parity coverage in
tests/integration/test_inspect_table.py::test_inspect_partitions_partitioned
exercises this column against Spark but requires Docker + Spark, so the added
unit test provides a deterministic, order-independent regression check.

Are there any user-facing changes?

Yes. Table.inspect.partitions() now reports the correct last_updated_at and
last_updated_snapshot_id for partitions modified by multiple snapshots. There
are no API or schema changes.


## Notes for opening the PR
- Squash-merge repo; the PR title above is the commit subject. Keep the `fix:`
  prefix (history uses `fix:` / `feat:` / `docs:`).
- No tracking issue exists, so no `Closes #`. Optionally open a short tracking
  issue first (repo norm favors it) and add `Closes #<n>`; not required.
- The `changelog` label likely warrants setting for a user-visible correctness
  fix, but an external contributor cannot set labels - leave a one-line maintainer
  comment noting it if desired.
- Before opening: rebase onto current `origin/main` and re-run `make lint`
  (verified today the branch base is the tip of `origin/main`, 0 behind, but this
  repo moves fast).

## Duplicate check (fresh, at draft time)
- `gh pr list -R apache/iceberg-python` open PRs touching inspect: #3457, #3555
  (docstring-only), #3454 (pagination), #3131 (replace API), #3320 (commit retry).
  None touch the `last_updated_at` / `last_updated_snapshot_id` comparison.
- `gh pr/issue list --search "last_updated partitions" --state all`: no results.
  No competing PR, no tracking issue.

`InspectTable._update_partitions_map_from_manifest_entry` aggregates
per-partition stats over manifest entries, which are iterated in manifest
order rather than chronologically. The guard that keeps the most recently
committed snapshot per partition compared the stored `last_updated_snapshot_id`
(a snapshot id) against `snapshot.timestamp_ms` (a commit timestamp). Snapshot
ids are large positive 63-bit integers, so `id < timestamp_ms` is effectively
always false and the row keeps whichever entry was visited first instead of the
newest one, reporting the wrong `last_updated_at` / `last_updated_snapshot_id`
for any partition touched by more than one snapshot.

Compare the stored `last_updated_at` timestamp against the incoming
`snapshot.timestamp_ms`, matching the Java `PartitionsTable.Partition.update`
which uses `snapshotCommitTime > this.lastUpdatedAt`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant