feat: add PaginationList for lazy page fetching in catalog list operations by stark256-spec · Pull Request #3454 · apache/iceberg-python

stark256-spec · 2026-06-03T18:43:57Z

Problem

list_tables, list_views, and list_namespaces in the REST catalog eagerly collect every page before returning, even if the caller only needs the first few results. In namespaces with thousands of tables this creates unnecessary network round-trips and latency before the first result is visible.

Closes #3365

Solution

Adds PaginationList[T] (pyiceberg/utils/pagination.py) — a list subclass that pre-loads the first page and lazily fetches subsequent pages only as the caller iterates past items already in memory.

Design

Operation	Behaviour
`for item in result`	Lazy — next page fetched only when iterator exhausts current buffer
`result[0]` / `result[2]`	Lazy — fetches pages until the requested index is available
`result[-1]` / `result[1:3]` / `len(result)` / `x in result` / `result == other`	Eager — fetches all remaining pages
`isinstance(result, list)`	`True` — full backward compatibility

Key properties

Zero breaking changes: PaginationList subclasses list, so all existing call sites that iterate, compare, or extend the return value continue to work without modification.
First page always pre-loaded: Callers that only look at the first few items pay zero extra latency compared to the old implementation.
Single fetch per page: Each page token is consumed at most once; no redundant requests.

Changes

pyiceberg/utils/pagination.py — new PaginationList[T] class
pyiceberg/catalog/rest/__init__.py — list_tables, list_views, list_namespaces refactored to return PaginationList
tests/utils/test_pagination.py — 14 unit tests for all PaginationList operations
tests/catalog/test_rest.py — test_list_tables_returns_pagination_list verifies lazy behaviour (call count stays at 1 while iterating within the first page, rises to 2 only after crossing the page boundary)

…tions Implements a PaginationList[T] class (pyiceberg/utils/pagination.py) that extends list and lazily fetches subsequent pages from a paginated REST endpoint only as the caller iterates past items already in memory. Design: - First page is pre-loaded on construction; the list is immediately usable for callers that only inspect the first page or stop iteration early. - __iter__ advances through in-memory items and triggers a new HTTP fetch only when the iterator would otherwise block on an empty buffer. - Operations requiring the full result set (__len__, __contains__, __eq__, slicing, repr, negative indexing) call _fetch_all() before delegating to the underlying list implementation. - Positive-index __getitem__ calls _fetch_through_index() to fetch only as many pages as needed to satisfy the access. - Subclasses list[T], so isinstance(result, list) remains True and all existing call sites that iterate, compare, or extend the return value continue to work without changes. REST catalog integration: - list_tables, list_views, and list_namespaces in RestCatalog now return a PaginationList instead of eagerly collecting all pages in a while loop. - Each method defines a _fetch_page closure capturing the session, URL and params, matching the (items, next_token | None) callback contract. Tests: - tests/utils/test_pagination.py: 14 unit tests covering single-page and multi-page scenarios, lazy iteration, partial consumption, __len__, __contains__, __eq__, slicing, negative indexing, and repr. - tests/catalog/test_rest.py: adds test_list_tables_returns_pagination_list which verifies that iterating only the first two items does not trigger the second HTTP request (call_count == 1), and that consuming all items triggers exactly one additional request (call_count == 2). Closes apache#3365

- Change PaginationList base from 'list, Generic[T]' to 'list[T]' to satisfy mypy type-arg requirement - Replace 'int' with 'SupportsIndex' in __getitem__ overloads to match the supertype signature expected by mypy - Remove now-unnecessary '# type: ignore[return-value]' comment - Add docstrings to all public methods (ruff D requirement) - Add return type annotation to _make_fetch in test file - Use PaginationList[int] with type parameter in test helper signature - Import Callable from collections.abc in test file

rambleraptor · 2026-06-03T20:21:37Z

Thanks for doing this! This was my idea and I'm super excited to have this in here. Can you run make lint and get CI to pass? Happy to do a review after that.

rambleraptor · 2026-06-03T20:23:31Z

Is there any way you can do some quick performance tests? I'm particularly interested in len() - what does it look like when you only have one page of results versus multiple pages?

(People call len() on arrays all the time. I want to make sure we don't have a performance regression in the normal case where you don't have other pages)

Ruff fixes: - Remove unused 'import pytest' (F401) - Remove extraneous f-prefix from plain string 'tok1' (F541) Performance tests (requested by reviewer): - test_len_single_page_makes_no_extra_fetches: verifies that len() on a single-page PaginationList does not invoke the fetch callback at all - test_len_multi_page_fetches_all_pages_once: verifies that len() fetches remaining pages exactly once and caches the result for subsequent calls

stark256-spec · 2026-06-03T21:52:51Z

Fixed the two ruff errors (import pytest unused, bare f"tok1" f-string) and added the performance tests you asked about:

test_len_single_page_makes_no_extra_fetches: asserts the fetch callback is never called when all results are in the first page
test_len_multi_page_fetches_all_pages_once: asserts pages 2 and 3 are each fetched exactly once, and a second len() call hits the cache with 0 additional fetches

The len() cost for a single-page list is identical to a plain list — zero network calls.

…ocstyle D105

@overload

D418 prohibits docstrings on @overload stubs; suppress D105 (missing docstring in magic method) with inline noqa comments instead.

@overload

… conflict pydocstyle D105 requires docstrings on magic methods but D418 prohibits docstrings on @overload stubs — these rules are irreconcilable for @overload-decorated __getitem__. Since PaginationList is the only file in the project using @overload, there is no established project pattern. Drop the two stubs and keep only the concrete implementation with its existing docstring. The union signature 'SupportsIndex | slice -> T | list[T]' is correct; callers lose the narrowed return type but the runtime behaviour is unchanged. Also removes the now-unused 'overload' import.

…erload

…erload stubs D105 (missing docstring in magic method) and D418 (@overload stubs must not have docstrings) are mutually exclusive for @overload-decorated __getitem__. Removing the stubs silenced pydocstyle but caused 3 mypy [override] errors because list[T]'s __getitem__ is itself overloaded and the single-signature Union form is incompatible with the supertype. Fix: add D105 and D418 to the pydocstyle --ignore list, consistent with D107 (missing __init__ docstring) which is already suppressed. Restore the @overload stubs so mypy correctly narrows SupportsIndex -> T and slice -> list[T]. Verified: mypy passes, ruff passes, pydocstyle passes.

…on (isort)

…t assertions RestCatalog.__init__ calls v1/config which increments rest_mock.call_count. Capture the baseline after construction and assert relative to it.

ebyhr · 2026-06-04T03:05:41Z

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

GayathriSrividya · 2026-06-04T04:39:38Z

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

Thanks for flagging this. PR #3416 was opened first and it overlaps with #3454 on the REST lazy pagination work for #3365.

The main difference is approach:

Lazy pagination iterator for catalog list methods #3416 moves the catalog list APIs toward iterator semantics across catalogs.
feat: add PaginationList for lazy page fetching in catalog list operations #3454 keeps list compatibility and adds a lazy PaginationList for REST.

I’m happy to coordinate and avoid duplicate work. If maintainers prefer the backward-compatible PaginationList approach, I can narrow or rework #3416 accordingly. If the iterator-based API is preferred, then I think #3416 remains the better home for the change.

Please let me know which direction you’d like us to converge on.

stark256-spec · 2026-06-04T05:17:04Z

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

GayathriSrividya · 2026-06-04T07:24:23Z

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

Thanks Yuya for the coordination note, and thanks @stark256-spec for laying out the trade-off clearly.

From my side, the main intent of #3416 was to address #3365 by making pagination explicit through iterator semantics across catalog list operations. I agree this is a larger API change than returning a list-like wrapper, but I think the benefit is that it makes the lazy behavior clearer to callers and avoids pretending the full result set is already materialized.

That said, I also understand the compatibility advantage of #3454. If maintainers prefer preserving the current list[Identifier] contract, I’m happy to coordinate and make sure the relevant parts of #3416, especially the page-by-page REST fetching/retry behavior, are carried over cleanly.

So I’m flexible on the final direction, but my preference is still the explicit iterator approach unless backward compatibility is the deciding factor. Either way, I agree we should avoid duplicate implementations and converge on one solution for #3365.

rambleraptor · 2026-06-04T20:14:10Z

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

corleyma · 2026-06-04T22:20:22Z

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

Normally I'd agree, but since the iterator-based API is just much more pythonic, and the old API was designed absent any consideration for paging, why not just bite the bullet and do a major version bump? Less complexity to carry around in the implementation and it drives the project toward a cleaner API.

Especially these days, the worst case is somebody's LLM spends a few tokens updating for iterator semantics. Is it really that bad?

But also not a maintainer, I just think it's silly it has taken so long to land proper pagination support in pyiceberg because of compatibility issues in the age of coding agents.

Fokko · 2026-06-05T05:14:02Z

+T = TypeVar("T")
+
+
+class PaginationList(list[T]):


Can we move this type to typedef?

Fokko · 2026-06-05T05:22:15Z

-        while True:
-            if page_token:
-                params["pageToken"] = page_token
+        def _fetch_page(page_token: str) -> tuple[list[Identifier], str | None]:


I see theis tuple[list[T], str | None] coming back multiple times, should we introduce a TypeAliasType to improve readability?

Fokko · 2026-06-05T05:27:03Z

+    return pl, all_items
+
+
+class TestPaginationListSinglePage:


We don't use classes in the tests for the rest, and I think it would be good to keep that consistent

Fokko · 2026-06-05T05:27:17Z

        args:
          [
-            "--ignore=D100,D102,D101,D103,D104,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417",
+            "--ignore=D100,D102,D101,D103,D104,D105,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417,D418",


What are we adding here?

stark256-spec added 2 commits June 3, 2026 13:43

stark256-spec added 6 commits June 3, 2026 17:50

fix(pagination): add docstrings to __getitem__ overload stubs for pyd…

9fbcdd3

…ocstyle D105

fix(pagination): use noqa:D105 on overload stubs instead of docstrings

10c2ba0

D418 prohibits docstrings on @overload stubs; suppress D105 (missing docstring in magic method) with inline noqa comments instead.

fix(tests): move PaginationList import to correct alphabetical positi…

d71317c

…on (isort)

fix(tests): account for RestCatalog config endpoint call in call_coun…

1780ae6

…t assertions RestCatalog.__init__ calls v1/config which increments rest_mock.call_count. Capture the baseline after construction and assert relative to it.

Fokko reviewed Jun 5, 2026

View reviewed changes

Conversation

stark256-spec commented Jun 3, 2026

Problem

Solution

Design

Key properties

Changes

Uh oh!

rambleraptor commented Jun 3, 2026

Uh oh!

rambleraptor commented Jun 3, 2026

Uh oh!

stark256-spec commented Jun 3, 2026

Uh oh!

ebyhr commented Jun 4, 2026

Uh oh!

GayathriSrividya commented Jun 4, 2026

Uh oh!

stark256-spec commented Jun 4, 2026

Uh oh!

GayathriSrividya commented Jun 4, 2026

Uh oh!

rambleraptor commented Jun 4, 2026

Uh oh!

corleyma commented Jun 4, 2026

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants