fix: tighter NOT-filter approximation for deep pagination#2812
Open
alangmartini wants to merge 5 commits intotypesense:v31from
Open
fix: tighter NOT-filter approximation for deep pagination#2812alangmartini wants to merge 5 commits intotypesense:v31from
alangmartini wants to merge 5 commits intotypesense:v31from
Conversation
…nse#2806) Deep pagination with --enable-lazy-filter returns empty hits because the topster buffer is sized using approx_filter_ids_length, which undercounts for NOT-IN array filters. The OR-sum of excluded posting lists overcounts overlapping docs, causing num_ids - OR_sum to underestimate the NOT result size. String path: use num_ids - max(posting_list_sizes) instead of num_ids - OR_sum. This is provably safe (union >= max(Si)) and preserves exact behavior for single-value NOT-equals. The OR-sum hint is preserved for the eager-materialization threshold check. Int/Float paths: set approx to num_ids (the accumulation structure differs — each list already stores num_ids - Si, not raw Si). The OR-sum hint is separated for threshold decisions. Add regression test: 100 docs with compound NOT-IN filter, verifying all 70 matching results are retrievable across all pages.
635f3a1 to
0bed162
Compare
kneel2023
approved these changes
Mar 13, 2026
Multi-token string NOT filters can undercount lazy pagination because token-level estimates are not safe until exact value matching runs. Use a conservative bound for lazy NOT iterators and add regression coverage for deep pagination.
alangmartini
added a commit
to alangmartini/typesense
that referenced
this pull request
Mar 14, 2026
alangmartini
added a commit
to alangmartini/typesense
that referenced
this pull request
Mar 14, 2026
alangmartini
added a commit
to alangmartini/typesense
that referenced
this pull request
Mar 14, 2026
Add LazyFilterNotInNumericOverlapDeepPagination test that exercises the lazy NOT-filter approximation bug with overlapping int32[] posting lists. 10 docs with tags=[1,2] create overlapping posting lists (S1=S2=10). OR-sum=20 causes old approx = num_ids-20 = 5 (underestimate vs actual 15). With approx<=20 in TEST_BUILD, compute_iterators is skipped, topster is capped at 5, and page 2 returns 0 hits instead of 5. Verified: fails on v31 without fix, passes with PR typesense#2812 fix applied.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2806.
Deep pagination with
--enable-lazy-filterreturns empty hits becauseapprox_filter_ids_lengthundercounts for NOT-IN array filters. The OR-sum of excluded posting lists overcounts overlapping docs, sonum_ids - OR_sumunderestimates the NOT result size, causing the topster buffer to be too small.String path: use
num_ids - max(posting_list_sizes)instead ofnum_ids - OR_sum. Sinceunion(S1..Sn) >= max(Si), this is provably safe while being tighter than thenum_idsfallback from #2811. Preserves exact behavior for single-value NOT-equals (wheremax == sum). The OR-sum hint is kept for the eager-materialization threshold decision.Int/Float paths: set
approx_filter_ids_length = num_ids. The accumulation structure differs (each list storesnum_ids - Si, not rawSi), so the max trick doesn't directly apply. The OR-sum hint is separated for threshold decisions.Test plan
bazel build //:typesense-serverpassesCollectionFilteringTest.LazyFilterNotInArrayDeepPaginationpasses (new regression test)FilterTest.FilterTreeIteratorpasses (existingtags: != g*—max == sum, assertion unchanged)FilterTest.NotEqualsStringFilterpasses (updatedtags: != [gold, silver]assertion: 5 → 2)