fix: approx_distinct over-counts for utf8view by haohuaijin · Pull Request #22815 · apache/datafusion

haohuaijin · 2026-06-08T03:38:26Z

Which issue does this PR close?

Closes approx_distinct over-counts Utf8View because the hash strategy is chosen per batch instead of per value #22796

Rationale for this change

approx_distinct over-counted distinct values for Utf8View columns when the same short string appeared across batches with different layouts.

Arrow stores strings ≤ 12 bytes inline in the 128-bit view integer. The fast path (no data buffers) hashed these as raw u128. But when a batch also had a long string, it fell into a different branch that hashed all strings as &str — including the short inline ones. The same string hashed differently in different batches, so HyperLogLog counted it twice.

What changes are included in this PR?

StringViewHLLAccumulator::update_batch and Utf8ViewHasher: in mixed batches (data buffers present), short strings (≤ 12 bytes) are still hashed as the raw u128 view; only long strings hash as &str. This keeps hashing consistent regardless of batch layout.
Two regression tests:
- utf8view_acc_split_batches_match_single_mixed_batch — scalar accumulator
- utf8view_groups_short_string_hashed_consistently_across_batches — group accumulator

Are these changes tested?

Yes, two new regression tests cover the exact failure mode.

Are there any user-facing changes?

Yes. approx_distinct on Utf8View / VARCHAR VIEW columns now returns correct (lower) counts. Results may differ from the previously incorrect values.

haohuaijin · 2026-06-08T03:39:54Z

cc @neilconway

2010YOUY01

Thank you for the fix and explanation! It LGTM.

Here is a refactor idea to make it potentially faster and simpler (follow up PR perhaps):

Use

datafusion/datafusion/common/src/hash_utils.rs

Line 1100 in c83a981

pub fn create_hashes<'a, I, T>(

for batched hashing
Potentially move this fast path for StringView into create_hashes
Simplify the update_batch() logic to hll.add_hashed() only

It seems we don't need to implement accumulator for different types with this approach, and it should also be faster since it vectorized the hashing step.

haohuaijin · 2026-06-08T13:27:37Z

Thanks for you reviews @2010YOUY01 , i have try you suggestion and run the benchmark, but the performance almost same. i will check more, if any found, i will do a follower pr.

martin-g · 2026-06-08T14:21:31Z

    }

-    /// Reference count: fold the given distinct hashes straight into a dense
-    /// HyperLogLog. The grouped sketch must agree with this exactly.


Why the docs are removed ?
The functions are private but the information is useful, no ?

removed by mistake, but it is only in the test case, do we need get it back.

fix: approx_distinct over-counts for utf8view

4ecb88c

github-actions Bot added the functions Changes to functions implementation label Jun 8, 2026

haohuaijin closed this Jun 8, 2026

haohuaijin reopened this Jun 8, 2026

2010YOUY01 approved these changes Jun 8, 2026

View reviewed changes

martin-g approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: approx_distinct over-counts for utf8view#22815

fix: approx_distinct over-counts for utf8view#22815
haohuaijin wants to merge 1 commit into
apache:mainfrom
haohuaijin:fix-approx-distinct

haohuaijin commented Jun 8, 2026

Uh oh!

haohuaijin commented Jun 8, 2026

Uh oh!

2010YOUY01 left a comment •

edited

Loading

Uh oh!

haohuaijin commented Jun 8, 2026

Uh oh!

martin-g Jun 8, 2026

Uh oh!

haohuaijin Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haohuaijin commented Jun 8, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

haohuaijin commented Jun 8, 2026

Uh oh!

2010YOUY01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haohuaijin commented Jun 8, 2026

Uh oh!

martin-g Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

haohuaijin Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2010YOUY01 left a comment •

edited

Loading