fix: approx_distinct over-counts for utf8view#22815
Conversation
|
cc @neilconway |
There was a problem hiding this comment.
Thank you for the fix and explanation! It LGTM.
Here is a refactor idea to make it potentially faster and simpler (follow up PR perhaps):
- Use for batched hashing
datafusion/datafusion/common/src/hash_utils.rs
Line 1100 in c83a981
- Potentially move this fast path for
StringViewintocreate_hashes - Simplify the
update_batch()logic tohll.add_hashed()only
It seems we don't need to implement accumulator for different types with this approach, and it should also be faster since it vectorized the hashing step.
|
Thanks for you reviews @2010YOUY01 , i have try you suggestion and run the benchmark, but the performance almost same. i will check more, if any found, i will do a follower pr. |
| } | ||
|
|
||
| /// Reference count: fold the given distinct hashes straight into a dense | ||
| /// HyperLogLog. The grouped sketch must agree with this exactly. |
There was a problem hiding this comment.
Why the docs are removed ?
The functions are private but the information is useful, no ?
There was a problem hiding this comment.
removed by mistake, but it is only in the test case, do we need get it back.
Which issue does this PR close?
approx_distinctover-countsUtf8Viewbecause the hash strategy is chosen per batch instead of per value #22796Rationale for this change
approx_distinctover-counted distinct values forUtf8Viewcolumns when the same short string appeared across batches with different layouts.Arrow stores strings ≤ 12 bytes inline in the 128-bit view integer. The fast path (no data buffers) hashed these as raw
u128. But when a batch also had a long string, it fell into a different branch that hashed all strings as&str— including the short inline ones. The same string hashed differently in different batches, so HyperLogLog counted it twice.What changes are included in this PR?
StringViewHLLAccumulator::update_batchandUtf8ViewHasher: in mixed batches (data buffers present), short strings (≤ 12 bytes) are still hashed as the rawu128view; only long strings hash as&str. This keeps hashing consistent regardless of batch layout.Two regression tests:
utf8view_acc_split_batches_match_single_mixed_batch— scalar accumulatorutf8view_groups_short_string_hashed_consistently_across_batches— group accumulatorAre these changes tested?
Yes, two new regression tests cover the exact failure mode.
Are there any user-facing changes?
Yes.
approx_distinctonUtf8View/VARCHAR VIEWcolumns now returns correct (lower) counts. Results may differ from the previously incorrect values.