Skip to content

feat: extend single ndv optimization to non-arithmetic supporting types for equality predicates #21473

Merged
berkaysynnada merged 9 commits intoapache:mainfrom
buraksenn:extend-single-ndv-optimization
Apr 17, 2026
Merged

feat: extend single ndv optimization to non-arithmetic supporting types for equality predicates #21473
berkaysynnada merged 9 commits intoapache:mainfrom
buraksenn:extend-single-ndv-optimization

Conversation

@buraksenn
Copy link
Copy Markdown
Contributor

@buraksenn buraksenn commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

In the case of equality we can say ndv is 1 for non-numeric types as well. Please check issue for detailed explanation

What changes are included in this PR?

Adding a type-agnostic pattern-matching layer that inspects the predicate tree for col = literal shapes and sets NDV=Exact(1) regardless of column type

Are these changes tested?

Yes existing and additional unit tests.

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 8, 2026
@asolimando
Copy link
Copy Markdown
Member

Thanks @buraksenn for tackling both #21109 and #21111 together with this type-agnostic approach, also the test consolidation over the separate per-type functions is a nice improvement.

A couple of minor notes/questions:

  • Since we can't rely on interval analysis for string-based columns, I am not sure we handle correctly "contradictory" filters like name = 'alice' AND name = 'bob', which should get Exact(0) as happening for numeric types. I don't see this as a blocker because it's not worse than before and this is an edge case, but it's good to know where we stand and file a follow-up issue, if needed.

  • On a similar note, I was wondering if something like col = NULL gets simplified as false (under the unknown as false semantics for predicates in WHERE clauses). In case it doesn't, it would be worth adding a test making sure we don't set NDV to 1.

  • Did you manage to understand why the interval analysis doesn't seem to work as expected for temporal types? It might be worth filing a separate issue for that if you have bandwidth, so we don't lose track of this issue.

@buraksenn
Copy link
Copy Markdown
Contributor Author

Thanks @asolimando for the review. I've checked this again. To make it generic I've missed that I can actually make temporal types work with interval arithmetic analysis since they are i64 backed as well.

Thus answering all:

  • I'm kinda afraid into getting this kind of complex behavior since I may not know what can be a side effect. Maybe I can add it to this PR and see the reviews
  • I've missed this totally will push to this branch.
  • For this, adding support for temporal types seem to be better as I've missed whiel trying to make it generic so I'll open a PR parallel to this one.

I'm working on both changes now and will handle this in the upcoming hours

@asolimando
Copy link
Copy Markdown
Member

Thanks @asolimando for the review. I've checked this again. To make it generic I've missed that I can actually make temporal types work with interval arithmetic analysis since they are i64 backed as well.

Sounds great!

Thus answering all:

  • I'm kinda afraid into getting this kind of complex behavior since I may not know what can be a side effect. Maybe I can add it to this PR and see the reviews

I think it's enough to add a test to see what happens, before we would have been returning the same NDV value for the input column anyway, I don't think we are doing worse than that, let's add a test and possibly file an issue to not lose the context we built.

  • I've missed this totally will push to this branch.
  • For this, adding support for temporal types seem to be better as I've missed whiel trying to make it generic so I'll open a PR parallel to this one.

No worries, making it type agnostic was a good call IMO, but what can be handled with interval arithmetic should do it as we can benefit from all the handling of edge cases like the "incoherent predicate" above and possibly others I couldn't think of.

I'm working on both changes now and will handle this in the upcoming hours

No worries, feel free to ping me if you need a review, happy to take a look, thanks for working on this and helping out on so many stats-related tasks :)

@buraksenn buraksenn changed the title feat: extend single ndv optimization to non-numeric types for equality predicates feat: extend single ndv optimization to non-arithmetic supporting types for equality predicates Apr 9, 2026
@buraksenn
Copy link
Copy Markdown
Contributor Author

Thanks @asolimando for the review. I've checked this again. To make it generic I've missed that I can actually make temporal types work with interval arithmetic analysis since they are i64 backed as well.

Sounds great!

Thus answering all:

  • I'm kinda afraid into getting this kind of complex behavior since I may not know what can be a side effect. Maybe I can add it to this PR and see the reviews

I think it's enough to add a test to see what happens, before we would have been returning the same NDV value for the input column anyway, I don't think we are doing worse than that, let's add a test and possibly file an issue to not lose the context we built.

  • I've missed this totally will push to this branch.
  • For this, adding support for temporal types seem to be better as I've missed whiel trying to make it generic so I'll open a PR parallel to this one.

No worries, making it type agnostic was a good call IMO, but what can be handled with interval arithmetic should do it as we can benefit from all the handling of edge cases like the "incoherent predicate" above and possibly others I couldn't think of.

I'm working on both changes now and will handle this in the upcoming hours

No worries, feel free to ping me if you need a review, happy to take a look, thanks for working on this and helping out on so many stats-related tasks :)

Thanks @asolimando I've opened #21520 temporal types I think it is very straightforward and contained PR and only left non-arithmetic supporting changes here.

Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing all the comments!

@asolimando
Copy link
Copy Markdown
Member

@buraksenn I think this is in good shape, you might want to ping a committer (ideally someone with context on this who helped merging related PRs).

@buraksenn
Copy link
Copy Markdown
Contributor Author

@buraksenn I think this is in good shape, you might want to ping a committer (ideally someone with context on this who helped merging related PRs).

Makes sense @asolimando thanks. @2010YOUY01, @xudong963 pinging you since you've taken a look to similar PRs. Can you review this PR whenever you've time to do please? (also @berkaysynnada)

Copy link
Copy Markdown
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @buraksenn. Mostly lgtm, let's iterate

Comment thread datafusion/physical-plan/src/filter.rs
Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment thread datafusion/physical-plan/src/filter.rs
@buraksenn
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review @berkaysynnada I've applied each comment and pushed the changes

Copy link
Copy Markdown
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm now

@berkaysynnada berkaysynnada added this pull request to the merge queue Apr 17, 2026
Merged via the queue into apache:main with commit 2f0ca3d Apr 17, 2026
35 checks passed
@berkaysynnada berkaysynnada deleted the extend-single-ndv-optimization branch April 17, 2026 10:06
github-merge-queue bot pushed a commit that referenced this pull request Apr 17, 2026
## Which issue does this PR close?
- Closes #21111

related with #21109 and its PR
#21473

## Rationale for this change
Temporal types can actually support internal arithmetics and with this
change internal arithmetics can now narrow down column boundaries and
selectivity instead of falling back to default selectivity of 1.0

## What changes are included in this PR?
Extend internal_arithmetic types with date and duration types

## Are these changes tested?
Yes adjusted tests

## Are there any user-facing changes?
no
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Extend single-value NDV optimization to string types

3 participants