Skip to content

fix: approx_distinct over-counts for utf8view#22815

Open
haohuaijin wants to merge 1 commit into
apache:mainfrom
haohuaijin:fix-approx-distinct
Open

fix: approx_distinct over-counts for utf8view#22815
haohuaijin wants to merge 1 commit into
apache:mainfrom
haohuaijin:fix-approx-distinct

Conversation

@haohuaijin
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

approx_distinct over-counted distinct values for Utf8View columns when the same short string appeared across batches with different layouts.

Arrow stores strings ≤ 12 bytes inline in the 128-bit view integer. The fast path (no data buffers) hashed these as raw u128. But when a batch also had a long string, it fell into a different branch that hashed all strings as &str — including the short inline ones. The same string hashed differently in different batches, so HyperLogLog counted it twice.

What changes are included in this PR?

  • StringViewHLLAccumulator::update_batch and Utf8ViewHasher: in mixed batches (data buffers present), short strings (≤ 12 bytes) are still hashed as the raw u128 view; only long strings hash as &str. This keeps hashing consistent regardless of batch layout.

  • Two regression tests:

    • utf8view_acc_split_batches_match_single_mixed_batch — scalar accumulator
    • utf8view_groups_short_string_hashed_consistently_across_batches — group accumulator

Are these changes tested?

Yes, two new regression tests cover the exact failure mode.

Are there any user-facing changes?

Yes. approx_distinct on Utf8View / VARCHAR VIEW columns now returns correct (lower) counts. Results may differ from the previously incorrect values.

@github-actions github-actions Bot added the functions Changes to functions implementation label Jun 8, 2026
@haohuaijin
Copy link
Copy Markdown
Contributor Author

cc @neilconway

@haohuaijin haohuaijin closed this Jun 8, 2026
@haohuaijin haohuaijin reopened this Jun 8, 2026
Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix and explanation! It LGTM.

Here is a refactor idea to make it potentially faster and simpler (follow up PR perhaps):

  1. Use
    pub fn create_hashes<'a, I, T>(
    for batched hashing
  2. Potentially move this fast path for StringView into create_hashes
  3. Simplify the update_batch() logic to hll.add_hashed() only

It seems we don't need to implement accumulator for different types with this approach, and it should also be faster since it vectorized the hashing step.

@haohuaijin
Copy link
Copy Markdown
Contributor Author

Thanks for you reviews @2010YOUY01 , i have try you suggestion and run the benchmark, but the performance almost same. i will check more, if any found, i will do a follower pr.

}

/// Reference count: fold the given distinct hashes straight into a dense
/// HyperLogLog. The grouped sketch must agree with this exactly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the docs are removed ?
The functions are private but the information is useful, no ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed by mistake, but it is only in the test case, do we need get it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

approx_distinct over-counts Utf8View because the hash strategy is chosen per batch instead of per value

3 participants