fix(dataframe): handle FixedSizeBinary in describe#21455
fix(dataframe): handle FixedSizeBinary in describe#21455officialasishkumar wants to merge 3 commits intoapache:mainfrom
Conversation
kumarUjjawal
left a comment
There was a problem hiding this comment.
LGTM, can we add a mixed-schema test (numeric + FixedSizeBinary, or two columns where one is filtered) so the regression test actually covers the filter rather than the empty-aggregate fallback. The other items are nice-to-haves.
acad2de to
e7da510
Compare
officialasishkumar
left a comment
There was a problem hiding this comment.
Thanks for the review @kumarUjjawal! I've added a describe_mixed_numeric_and_fixed_size_binary test that creates a table with both an Int32 column and a FixedSizeBinary(3) column. This ensures the min/max filter path is exercised (the numeric column gets min/max computed while the FixedSizeBinary column is filtered out), rather than falling through to the empty-aggregate fallback that occurs when all columns are unsupported.
The branch has also been rebased on latest main.
|
@officialasishkumar Looks good, thanks. Can you update the pr body with mention of two new tests. |
|
@kumarUjjawal Updated the PR body to call out both new tests: |
Jefffrey
left a comment
There was a problem hiding this comment.
I'm a bit curious on what is the intended behaviour here. The docstring of describe states that it Only summarizes numeric datatypes at the moment and returns nulls for non numeric datatypes, but it seems we do support non-numeric types for min/max (except binary) 🤔
And it seems this error only occurs because we naively try to cast binary to utf8 for display purposes? If so would it be more correct to hex encode it for display?
I'm just wondering if we can try align this function instead of fixing a minor bug but still leaving it a bit ambiguous what describe is meant to actually support.
| | DataType::Binary | ||
| | DataType::LargeBinary | ||
| | DataType::BinaryView | ||
| | DataType::FixedSizeBinary(_) |
There was a problem hiding this comment.
Updated the describe min/max predicate to use DataType::is_binary() while keeping boolean excluded.
Skip min/max describe summaries for unsupported binary-like types so describe falls back to nulls instead of attempting an invalid Utf8 cast. Add a regression test for FixedSizeBinary and rerun the dataframe describe integration tests reported in apache#20273. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
…xedSizeBinary Add a test that combines numeric (Int32) and FixedSizeBinary columns to exercise the filter path in describe(), where min/max aggregations skip FixedSizeBinary but still compute results for numeric columns. This covers the filter rather than the empty-aggregate fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
203da3f to
05f4d07
Compare
Which issue does this PR close?
Rationale for this change
DataFrame::describe()builds min/max aggregates for non-numeric columns and then casts the results toUtf8for display. That works for strings, but it fails for unsupported binary-like outputs such asFixedSizeBinary, which currently causes describe to error instead of falling back to null summary values.What changes are included in this PR?
FixedSizeBinaryas an unsupported min/max describe type, alongside the other binary-like types that cannot be rendered through the currentUtf8cast path.describe()on aFixedSizeBinarycolumn.describe()with numeric andFixedSizeBinarycolumns so the filtered min/max path is exercised directly.Are these changes tested?
cargo test -p datafusion --test core_integration describe_fixed_size_binary -- --nocapturecargo test -p datafusion --test core_integration describe_mixed_numeric_and_fixed_size_binary -- --nocapturecargo test -p datafusion --test core_integration dataframe::describe:: -- --nocaptureAre there any user-facing changes?
describe()no longer errors onFixedSizeBinarycolumns; unsupported min/max summaries now fall back tonullas intended.