Skip to content

feat: add PaginationList for lazy page fetching in catalog list operations#3454

Open
stark256-spec wants to merge 9 commits into
apache:mainfrom
stark256-spec:feat/lazy-pagination-list
Open

feat: add PaginationList for lazy page fetching in catalog list operations#3454
stark256-spec wants to merge 9 commits into
apache:mainfrom
stark256-spec:feat/lazy-pagination-list

Conversation

@stark256-spec
Copy link
Copy Markdown

Problem

list_tables, list_views, and list_namespaces in the REST catalog eagerly collect every page before returning, even if the caller only needs the first few results. In namespaces with thousands of tables this creates unnecessary network round-trips and latency before the first result is visible.

Closes #3365

Solution

Adds PaginationList[T] (pyiceberg/utils/pagination.py) — a list subclass that pre-loads the first page and lazily fetches subsequent pages only as the caller iterates past items already in memory.

Design

Operation Behaviour
for item in result Lazy — next page fetched only when iterator exhausts current buffer
result[0] / result[2] Lazy — fetches pages until the requested index is available
result[-1] / result[1:3] / len(result) / x in result / result == other Eager — fetches all remaining pages
isinstance(result, list) True — full backward compatibility

Key properties

  • Zero breaking changes: PaginationList subclasses list, so all existing call sites that iterate, compare, or extend the return value continue to work without modification.
  • First page always pre-loaded: Callers that only look at the first few items pay zero extra latency compared to the old implementation.
  • Single fetch per page: Each page token is consumed at most once; no redundant requests.

Changes

  • pyiceberg/utils/pagination.py — new PaginationList[T] class
  • pyiceberg/catalog/rest/__init__.pylist_tables, list_views, list_namespaces refactored to return PaginationList
  • tests/utils/test_pagination.py — 14 unit tests for all PaginationList operations
  • tests/catalog/test_rest.pytest_list_tables_returns_pagination_list verifies lazy behaviour (call count stays at 1 while iterating within the first page, rises to 2 only after crossing the page boundary)

…tions

Implements a PaginationList[T] class (pyiceberg/utils/pagination.py) that
extends list and lazily fetches subsequent pages from a paginated REST
endpoint only as the caller iterates past items already in memory.

Design:
- First page is pre-loaded on construction; the list is immediately usable
  for callers that only inspect the first page or stop iteration early.
- __iter__ advances through in-memory items and triggers a new HTTP fetch
  only when the iterator would otherwise block on an empty buffer.
- Operations requiring the full result set (__len__, __contains__, __eq__,
  slicing, repr, negative indexing) call _fetch_all() before delegating to
  the underlying list implementation.
- Positive-index __getitem__ calls _fetch_through_index() to fetch only as
  many pages as needed to satisfy the access.
- Subclasses list[T], so isinstance(result, list) remains True and all
  existing call sites that iterate, compare, or extend the return value
  continue to work without changes.

REST catalog integration:
- list_tables, list_views, and list_namespaces in RestCatalog now return
  a PaginationList instead of eagerly collecting all pages in a while loop.
- Each method defines a _fetch_page closure capturing the session, URL and
  params, matching the (items, next_token | None) callback contract.

Tests:
- tests/utils/test_pagination.py: 14 unit tests covering single-page and
  multi-page scenarios, lazy iteration, partial consumption, __len__,
  __contains__, __eq__, slicing, negative indexing, and repr.
- tests/catalog/test_rest.py: adds test_list_tables_returns_pagination_list
  which verifies that iterating only the first two items does not trigger
  the second HTTP request (call_count == 1), and that consuming all items
  triggers exactly one additional request (call_count == 2).

Closes apache#3365
- Change PaginationList base from 'list, Generic[T]' to 'list[T]' to
  satisfy mypy type-arg requirement
- Replace 'int' with 'SupportsIndex' in __getitem__ overloads to match
  the supertype signature expected by mypy
- Remove now-unnecessary '# type: ignore[return-value]' comment
- Add docstrings to all public methods (ruff D requirement)
- Add return type annotation to _make_fetch in test file
- Use PaginationList[int] with type parameter in test helper signature
- Import Callable from collections.abc in test file
@rambleraptor
Copy link
Copy Markdown
Collaborator

Thanks for doing this! This was my idea and I'm super excited to have this in here. Can you run make lint and get CI to pass? Happy to do a review after that.

@rambleraptor
Copy link
Copy Markdown
Collaborator

Is there any way you can do some quick performance tests? I'm particularly interested in len() - what does it look like when you only have one page of results versus multiple pages?

(People call len() on arrays all the time. I want to make sure we don't have a performance regression in the normal case where you don't have other pages)

Ruff fixes:
- Remove unused 'import pytest' (F401)
- Remove extraneous f-prefix from plain string 'tok1' (F541)

Performance tests (requested by reviewer):
- test_len_single_page_makes_no_extra_fetches: verifies that len() on a
  single-page PaginationList does not invoke the fetch callback at all
- test_len_multi_page_fetches_all_pages_once: verifies that len() fetches
  remaining pages exactly once and caches the result for subsequent calls
@stark256-spec
Copy link
Copy Markdown
Author

Fixed the two ruff errors (import pytest unused, bare f"tok1" f-string) and added the performance tests you asked about:

  • test_len_single_page_makes_no_extra_fetches: asserts the fetch callback is never called when all results are in the first page
  • test_len_multi_page_fetches_all_pages_once: asserts pages 2 and 3 are each fetched exactly once, and a second len() call hits the cache with 0 additional fetches

The len() cost for a single-page list is identical to a plain list — zero network calls.

D418 prohibits docstrings on @overload stubs; suppress D105 (missing
docstring in magic method) with inline noqa comments instead.
… conflict

pydocstyle D105 requires docstrings on magic methods but D418 prohibits
docstrings on @overload stubs — these rules are irreconcilable for
@overload-decorated __getitem__. Since PaginationList is the only file in
the project using @overload, there is no established project pattern.

Drop the two stubs and keep only the concrete implementation with its
existing docstring. The union signature 'SupportsIndex | slice -> T | list[T]'
is correct; callers lose the narrowed return type but the runtime behaviour
is unchanged. Also removes the now-unused 'overload' import.
…erload stubs

D105 (missing docstring in magic method) and D418 (@overload stubs must not
have docstrings) are mutually exclusive for @overload-decorated __getitem__.
Removing the stubs silenced pydocstyle but caused 3 mypy [override] errors
because list[T]'s __getitem__ is itself overloaded and the single-signature
Union form is incompatible with the supertype.

Fix: add D105 and D418 to the pydocstyle --ignore list, consistent with
D107 (missing __init__ docstring) which is already suppressed. Restore the
@overload stubs so mypy correctly narrows SupportsIndex -> T and slice ->
list[T]. Verified: mypy passes, ruff passes, pydocstyle passes.
…t assertions

RestCatalog.__init__ calls v1/config which increments rest_mock.call_count.
Capture the baseline after construction and assert relative to it.
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jun 4, 2026

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

@GayathriSrividya
Copy link
Copy Markdown

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

Thanks for flagging this. PR #3416 was opened first and it overlaps with #3454 on the REST lazy pagination work for #3365.

The main difference is approach:

I’m happy to coordinate and avoid duplicate work. If maintainers prefer the backward-compatible PaginationList approach, I can narrow or rework #3416 accordingly. If the iterator-based API is preferred, then I think #3416 remains the better home for the change.

Please let me know which direction you’d like us to converge on.

@stark256-spec
Copy link
Copy Markdown
Author

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

@GayathriSrividya
Copy link
Copy Markdown

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

Thanks Yuya for the coordination note, and thanks @stark256-spec for laying out the trade-off clearly.

From my side, the main intent of #3416 was to address #3365 by making pagination explicit through iterator semantics across catalog list operations. I agree this is a larger API change than returning a list-like wrapper, but I think the benefit is that it makes the lazy behavior clearer to callers and avoids pretending the full result set is already materialized.

That said, I also understand the compatibility advantage of #3454. If maintainers prefer preserving the current list[Identifier] contract, I’m happy to coordinate and make sure the relevant parts of #3416, especially the page-by-page REST fetching/retry behavior, are carried over cleanly.

So I’m flexible on the final direction, but my preference is still the explicit iterator approach unless backward compatibility is the deciding factor. Either way, I agree we should avoid duplicate implementations and converge on one solution for #3365.

@rambleraptor
Copy link
Copy Markdown
Collaborator

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

@corleyma
Copy link
Copy Markdown

corleyma commented Jun 4, 2026

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

Normally I'd agree, but since the iterator-based API is just much more pythonic, and the old API was designed absent any consideration for paging, why not just bite the bullet and do a major version bump? Less complexity to carry around in the implementation and it drives the project toward a cleaner API.

Especially these days, the worst case is somebody's LLM spends a few tokens updating for iterator semantics. Is it really that bad?

But also not a maintainer, I just think it's silly it has taken so long to land proper pagination support in pyiceberg because of compatibility issues in the age of coding agents.

T = TypeVar("T")


class PaginationList(list[T]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this type to typedef?

while True:
if page_token:
params["pageToken"] = page_token
def _fetch_page(page_token: str) -> tuple[list[Identifier], str | None]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see theis tuple[list[T], str | None] coming back multiple times, should we introduce a TypeAliasType to improve readability?

return pl, all_items


class TestPaginationListSinglePage:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use classes in the tests for the rest, and I think it would be good to keep that consistent

Comment thread .pre-commit-config.yaml
args:
[
"--ignore=D100,D102,D101,D103,D104,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417",
"--ignore=D100,D102,D101,D103,D104,D105,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417,D418",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we adding here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement iterator to lazily go through the paged response

6 participants