feat: add PaginationList for lazy page fetching in catalog list operations#3454
feat: add PaginationList for lazy page fetching in catalog list operations#3454stark256-spec wants to merge 9 commits into
Conversation
…tions Implements a PaginationList[T] class (pyiceberg/utils/pagination.py) that extends list and lazily fetches subsequent pages from a paginated REST endpoint only as the caller iterates past items already in memory. Design: - First page is pre-loaded on construction; the list is immediately usable for callers that only inspect the first page or stop iteration early. - __iter__ advances through in-memory items and triggers a new HTTP fetch only when the iterator would otherwise block on an empty buffer. - Operations requiring the full result set (__len__, __contains__, __eq__, slicing, repr, negative indexing) call _fetch_all() before delegating to the underlying list implementation. - Positive-index __getitem__ calls _fetch_through_index() to fetch only as many pages as needed to satisfy the access. - Subclasses list[T], so isinstance(result, list) remains True and all existing call sites that iterate, compare, or extend the return value continue to work without changes. REST catalog integration: - list_tables, list_views, and list_namespaces in RestCatalog now return a PaginationList instead of eagerly collecting all pages in a while loop. - Each method defines a _fetch_page closure capturing the session, URL and params, matching the (items, next_token | None) callback contract. Tests: - tests/utils/test_pagination.py: 14 unit tests covering single-page and multi-page scenarios, lazy iteration, partial consumption, __len__, __contains__, __eq__, slicing, negative indexing, and repr. - tests/catalog/test_rest.py: adds test_list_tables_returns_pagination_list which verifies that iterating only the first two items does not trigger the second HTTP request (call_count == 1), and that consuming all items triggers exactly one additional request (call_count == 2). Closes apache#3365
- Change PaginationList base from 'list, Generic[T]' to 'list[T]' to satisfy mypy type-arg requirement - Replace 'int' with 'SupportsIndex' in __getitem__ overloads to match the supertype signature expected by mypy - Remove now-unnecessary '# type: ignore[return-value]' comment - Add docstrings to all public methods (ruff D requirement) - Add return type annotation to _make_fetch in test file - Use PaginationList[int] with type parameter in test helper signature - Import Callable from collections.abc in test file
|
Thanks for doing this! This was my idea and I'm super excited to have this in here. Can you run |
|
Is there any way you can do some quick performance tests? I'm particularly interested in len() - what does it look like when you only have one page of results versus multiple pages? (People call len() on arrays all the time. I want to make sure we don't have a performance regression in the normal case where you don't have other pages) |
Ruff fixes: - Remove unused 'import pytest' (F401) - Remove extraneous f-prefix from plain string 'tok1' (F541) Performance tests (requested by reviewer): - test_len_single_page_makes_no_extra_fetches: verifies that len() on a single-page PaginationList does not invoke the fetch callback at all - test_len_multi_page_fetches_all_pages_once: verifies that len() fetches remaining pages exactly once and caches the result for subsequent calls
|
Fixed the two ruff errors (
The |
D418 prohibits docstrings on @overload stubs; suppress D105 (missing docstring in magic method) with inline noqa comments instead.
… conflict pydocstyle D105 requires docstrings on magic methods but D418 prohibits docstrings on @overload stubs — these rules are irreconcilable for @overload-decorated __getitem__. Since PaginationList is the only file in the project using @overload, there is no established project pattern. Drop the two stubs and keep only the concrete implementation with its existing docstring. The union signature 'SupportsIndex | slice -> T | list[T]' is correct; callers lose the narrowed return type but the runtime behaviour is unchanged. Also removes the now-unused 'overload' import.
…erload stubs D105 (missing docstring in magic method) and D418 (@overload stubs must not have docstrings) are mutually exclusive for @overload-decorated __getitem__. Removing the stubs silenced pydocstyle but caused 3 mypy [override] errors because list[T]'s __getitem__ is itself overloaded and the single-signature Union form is incompatible with the supertype. Fix: add D105 and D418 to the pydocstyle --ignore list, consistent with D107 (missing __init__ docstring) which is already suppressed. Restore the @overload stubs so mypy correctly narrows SupportsIndex -> T and slice -> list[T]. Verified: mypy passes, ruff passes, pydocstyle passes.
…t assertions RestCatalog.__init__ calls v1/config which increments rest_mock.call_count. Capture the baseline after construction and assert relative to it.
|
@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts? |
Thanks for flagging this. PR #3416 was opened first and it overlaps with #3454 on the REST lazy pagination work for #3365. The main difference is approach:
I’m happy to coordinate and avoid duplicate work. If maintainers prefer the backward-compatible PaginationList approach, I can narrow or rework #3416 accordingly. If the iterator-based API is preferred, then I think #3416 remains the better home for the change. Please let me know which direction you’d like us to converge on. |
|
Thanks for the coordination note. Happy to defer to the maintainers on which approach lands. To summarise the trade-off for the review: #3416 (iterator semantics): Changes return types to #3454 (PaginationList): Keeps If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece. Whatever the decision, closing is fine — the important thing is that #3365 gets resolved. |
Thanks Yuya for the coordination note, and thanks @stark256-spec for laying out the trade-off clearly. From my side, the main intent of #3416 was to address #3365 by making pagination explicit through iterator semantics across catalog list operations. I agree this is a larger API change than returning a list-like wrapper, but I think the benefit is that it makes the lazy behavior clearer to callers and avoids pretending the full result set is already materialized. That said, I also understand the compatibility advantage of #3454. If maintainers prefer preserving the current So I’m flexible on the final direction, but my preference is still the explicit iterator approach unless backward compatibility is the deciding factor. Either way, I agree we should avoid duplicate implementations and converge on one solution for #3365. |
|
I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly. |
Normally I'd agree, but since the iterator-based API is just much more pythonic, and the old API was designed absent any consideration for paging, why not just bite the bullet and do a major version bump? Less complexity to carry around in the implementation and it drives the project toward a cleaner API. Especially these days, the worst case is somebody's LLM spends a few tokens updating for iterator semantics. Is it really that bad? But also not a maintainer, I just think it's silly it has taken so long to land proper pagination support in pyiceberg because of compatibility issues in the age of coding agents. |
| T = TypeVar("T") | ||
|
|
||
|
|
||
| class PaginationList(list[T]): |
There was a problem hiding this comment.
Can we move this type to typedef?
| while True: | ||
| if page_token: | ||
| params["pageToken"] = page_token | ||
| def _fetch_page(page_token: str) -> tuple[list[Identifier], str | None]: |
There was a problem hiding this comment.
I see theis tuple[list[T], str | None] coming back multiple times, should we introduce a TypeAliasType to improve readability?
| return pl, all_items | ||
|
|
||
|
|
||
| class TestPaginationListSinglePage: |
There was a problem hiding this comment.
We don't use classes in the tests for the rest, and I think it would be good to keep that consistent
| args: | ||
| [ | ||
| "--ignore=D100,D102,D101,D103,D104,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417", | ||
| "--ignore=D100,D102,D101,D103,D104,D105,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417,D418", |
Problem
list_tables,list_views, andlist_namespacesin the REST catalog eagerly collect every page before returning, even if the caller only needs the first few results. In namespaces with thousands of tables this creates unnecessary network round-trips and latency before the first result is visible.Closes #3365
Solution
Adds
PaginationList[T](pyiceberg/utils/pagination.py) — alistsubclass that pre-loads the first page and lazily fetches subsequent pages only as the caller iterates past items already in memory.Design
for item in resultresult[0]/result[2]result[-1]/result[1:3]/len(result)/x in result/result == otherisinstance(result, list)True— full backward compatibilityKey properties
PaginationListsubclasseslist, so all existing call sites that iterate, compare, or extend the return value continue to work without modification.Changes
pyiceberg/utils/pagination.py— newPaginationList[T]classpyiceberg/catalog/rest/__init__.py—list_tables,list_views,list_namespacesrefactored to returnPaginationListtests/utils/test_pagination.py— 14 unit tests for allPaginationListoperationstests/catalog/test_rest.py—test_list_tables_returns_pagination_listverifies lazy behaviour (call count stays at 1 while iterating within the first page, rises to 2 only after crossing the page boundary)