Skip to content

fix: tighter NOT-filter approximation for deep pagination#2812

Open
alangmartini wants to merge 5 commits intotypesense:v31from
alangmartini:fix/2806-not-filter-max-approx
Open

fix: tighter NOT-filter approximation for deep pagination#2812
alangmartini wants to merge 5 commits intotypesense:v31from
alangmartini:fix/2806-not-filter-max-approx

Conversation

@alangmartini
Copy link
Copy Markdown
Collaborator

@alangmartini alangmartini commented Mar 3, 2026

Summary

Fixes #2806.

Deep pagination with --enable-lazy-filter returns empty hits because approx_filter_ids_length undercounts for NOT-IN array filters. The OR-sum of excluded posting lists overcounts overlapping docs, so num_ids - OR_sum underestimates the NOT result size, causing the topster buffer to be too small.

String path: use num_ids - max(posting_list_sizes) instead of num_ids - OR_sum. Since union(S1..Sn) >= max(Si), this is provably safe while being tighter than the num_ids fallback from #2811. Preserves exact behavior for single-value NOT-equals (where max == sum). The OR-sum hint is kept for the eager-materialization threshold decision.

Int/Float paths: set approx_filter_ids_length = num_ids. The accumulation structure differs (each list stores num_ids - Si, not raw Si), so the max trick doesn't directly apply. The OR-sum hint is separated for threshold decisions.

  • Regression test: 100 docs with compound NOT-IN filter, verifies all 70 matching results are retrievable across all pages
  • Validated with 200K-doc reproducer script — all 140,000 matching results retrievable through pagination

Test plan

  • bazel build //:typesense-server passes
  • CollectionFilteringTest.LazyFilterNotInArrayDeepPagination passes (new regression test)
  • FilterTest.FilterTreeIterator passes (existing tags: != g*max == sum, assertion unchanged)
  • FilterTest.NotEqualsStringFilter passes (updated tags: != [gold, silver] assertion: 5 → 2)
  • 200K-doc reproducer script exits 0 (all found results retrievable)

…nse#2806)

Deep pagination with --enable-lazy-filter returns empty hits because
the topster buffer is sized using approx_filter_ids_length, which
undercounts for NOT-IN array filters. The OR-sum of excluded posting
lists overcounts overlapping docs, causing num_ids - OR_sum to
underestimate the NOT result size.

String path: use num_ids - max(posting_list_sizes) instead of
num_ids - OR_sum. This is provably safe (union >= max(Si)) and
preserves exact behavior for single-value NOT-equals. The OR-sum
hint is preserved for the eager-materialization threshold check.

Int/Float paths: set approx to num_ids (the accumulation structure
differs — each list already stores num_ids - Si, not raw Si). The
OR-sum hint is separated for threshold decisions.

Add regression test: 100 docs with compound NOT-IN filter, verifying
all 70 matching results are retrievable across all pages.
@alangmartini alangmartini force-pushed the fix/2806-not-filter-max-approx branch from 635f3a1 to 0bed162 Compare March 4, 2026 00:23
Multi-token string NOT filters can undercount lazy pagination because token-level estimates are not safe until exact value matching runs. Use a conservative bound for lazy NOT iterators and add regression coverage for deep pagination.
alangmartini added a commit to alangmartini/typesense that referenced this pull request Mar 14, 2026
alangmartini added a commit to alangmartini/typesense that referenced this pull request Mar 14, 2026
alangmartini added a commit to alangmartini/typesense that referenced this pull request Mar 14, 2026
Add LazyFilterNotInNumericOverlapDeepPagination test that exercises the
lazy NOT-filter approximation bug with overlapping int32[] posting lists.

10 docs with tags=[1,2] create overlapping posting lists (S1=S2=10).
OR-sum=20 causes old approx = num_ids-20 = 5 (underestimate vs actual 15).
With approx<=20 in TEST_BUILD, compute_iterators is skipped, topster is
capped at 5, and page 2 returns 0 hits instead of 5.

Verified: fails on v31 without fix, passes with PR typesense#2812 fix applied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deep pagination returns empty hits when using NOT-IN array filters with --enable-lazy-filter

2 participants