Skip to content

Update datafusion dependency to latest in preparation for DF54#1532

Draft
timsaucer wants to merge 3 commits into
apache:mainfrom
timsaucer:feat/prepare-df-54
Draft

Update datafusion dependency to latest in preparation for DF54#1532
timsaucer wants to merge 3 commits into
apache:mainfrom
timsaucer:feat/prepare-df-54

Conversation

@timsaucer
Copy link
Copy Markdown
Member

Which issue does this PR close?

We do not have an issue for this.

Rationale for this change

We are updating the upstream DataFusion dependency so that we can reduce the time to release 54 once the new version is released.

What changes are included in this PR?

Dependency bump:

  • [workspace.package].version: 53.0.054.0.0.
  • All datafusion* workspace deps switched from version = "53" to git = "https://github.com/apache/datafusion", rev = "3d06bedcc9afbd65781ac1de28741c36140d2cbb".
  • Cargo.lock refreshed for the datafusion family only.

Rust compile fixes (28 errors):

  • Drop as_any impls — upstream traits (AggregateUDFImpl, ScalarUDFImpl, WindowUDFImpl, SchemaProvider, CatalogProvider, CatalogProviderList, TableProvider, TableSource, ExecutionPlan) now have Any as a supertrait. Call sites switch from arc.as_any().downcast_ref::<T>() to the upstream-provided arc.downcast_ref::<T>() helper.
  • FFI provider conversions (Arc<dyn X + Send>Arc<dyn X>): upstream From<&FFI_*> no longer carries the redundant + Send bound now that the traits require Send/Sync as supertraits.
  • Cast / TryCast: data_type: DataTypefield: FieldRef. Python PyCast.data_type() accessor preserved.
  • Stub match arms for new Expr::HigherOrderFunction / Lambda / LambdaVariable variants returning Unsupported. Upstream HOFs are not shipped yet (apache/datafusion#14205).
  • Stub match arms for new ScalarValue::ListView / LargeListView. No 53.1.0 scalar functions produce these directly.
  • DatasetExec::partition_statistics returns Arc<Statistics>; add new required apply_expressions trait method (leaf returns Continue).
  • #[allow(deprecated)] on TableFunctionImpl::call pending a call_with_args migration that needs SessionState plumbing.

Python test fixes (23 expectations) for upstream behavior changes:

  • median / approx_median / approx_percentile_cont return Float64 (was matching input type).
  • String functions (concat_ws, lower, upper, repeat, reverse, split_part, translate) return StringView for StringView input (was String).
  • overlay appends past end-of-string rather than replacing.
  • arrays_zip / list_zip struct field names changed from c0/c1 to "1"/"2".
  • Filter on mismatched cast types now errors (was 0 matches).

check-upstream audit trivial wins:

  • New DataFrame.alias(name) — wraps the logical plan in a SubqueryAlias for self-joins and qualifier-style references.
  • functions.__all__: add instr and position (both already defined as public defs but missing from __all__).
  • Top-level datafusion.__all__: re-export TableProviderFactory and TableProviderFactoryExportable (previously reachable only via the datafusion.catalog submodule).

Are there any user-facing changes?

Yes — several behavior changes inherited from upstream DataFusion 54 (warrants api change label):

  • median / approx_median / approx_percentile_cont now return Float64 rather than matching the input type.
  • String functions return StringView when fed StringView input (concat_ws, lower, upper, repeat, reverse, split_part, translate).
  • overlay semantics: passing a start position past the end of a string now appends the replacement, e.g. overlay("!", "--", 2) → "!--" (was "--").
  • arrays_zip / list_zip field names changed: c0/c1"1"/"2".
  • Comparing a numeric column against an incompatible string literal in a filter now raises a Cannot cast string error, where previously it silently produced zero matches.
  • New: DataFrame.alias(name), instr and position now appear under from datafusion.functions import *, TableProviderFactory and TableProviderFactoryExportable are now reachable from the top-level datafusion namespace.

Follow-ups (not in this PR)

The check-upstream audit surfaced additional non-trivial gaps that each warrant their own design and PR:

  • DataFrame.registry / into_optimized_plan / into_unoptimized_plan / into_parts / task_ctx — each needs a new wrapper class (e.g. FunctionRegistry, SessionState, TaskContext).
  • SessionContext extensibility surface — I/O helpers (read_batch/read_batches), planner/rule extension (add_optimizer_rule, add_analyzer_rule, register_expr_planner, register_relation_planner, with_function_factory), state access (state, runtime_env, task_ctx, new_with_state), UDF introspection (udf/udaf/udwf lookup + listing), and misc helpers (create_physical_expr, table_function, table_factory, parse_capacity_limit). Tracked under EPIC #24.
  • Distinct-aware aggregates: count_distinct, sum_distinct, avg_distinct. Upstream design at apache/arrow-datafusion#2407.
  • TableFunctionImpl::call_with_args migration — needs SessionState plumbing through PyTableFunction. Will be a user-facing API change.
  • FFI Protocol pipeline completions for FFI_TableFunction (from_pycapsule, TableFunctionExportable, ABC), FFI_LogicalExtensionCodec, FFI_ExtensionOptions.
  • Scalar get_field_path (variant of get_field taking a path expression).

timsaucer and others added 2 commits May 11, 2026 09:03
Bump workspace deps to apache/datafusion@3d06bedc (git pin) in
preparation for the 54.0.0 release. Workspace package version moves
to 54.0.0 to track the upstream major convention.

Compile fixes:
- Drop as_any impls (trait now has Any as supertrait) and use the
  upstream-provided downcast_ref helper on dyn trait objects.
- Reconcile FFI provider From conversions to drop redundant `+ Send`
  on Arc<dyn ...> bounds.
- Cast/TryCast: data_type → field.data_type() (FieldRef rename).
- Stub match arms for new Expr::HigherOrderFunction / Lambda /
  LambdaVariable and ScalarValue::ListView / LargeListView variants;
  proper exposure deferred to PR 3 audit.
- DatasetExec: partition_statistics returns Arc<Statistics>; add
  required apply_expressions trait method.
- Suppress TableFunctionImpl::call deprecation pending call_with_args
  refactor that needs Session plumbing.

User-facing test updates for upstream behavior changes:
- median / approx_median / approx_percentile_cont now return Float64.
- String functions (concat_ws, lower, upper, repeat, reverse,
  split_part, translate) return StringView when given StringView.
- overlay appends past end-of-string rather than replacing the input.
- arrays_zip / list_zip struct field names "c0"/"c1" → "1"/"2".
- Filter on mismatched cast types now errors (was 0 matches).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the upstream DataFusion 53 → main bump. The
check-upstream audit (PR 3 of dev/release/upstream-sync.md) surfaced a
small set of trivial wins; this commit ships them.

Trivial wins:
- DataFrame.alias(name) — wraps the logical plan in a SubqueryAlias.
- functions.__all__: add `instr` and `position` (both were defined as
  public defs but missing from `__all__`, so they didn't show up in
  `from datafusion.functions import *` or generated docs).
- top-level `datafusion.__all__`: re-export `TableProviderFactory` and
  `TableProviderFactoryExportable` (previously only reachable via the
  `datafusion.catalog` submodule).

Non-trivial gaps surfaced by the audit (DataFrame.registry,
into_*/task_ctx, SessionContext extensibility surface, distinct-aware
aggregate variants, TableFunctionImpl::call_with_args migration, FFI
Protocol pipeline gaps) are deferred — each warrants its own design
and PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer changed the title Feat/prepare df 54 Update datafusion dependency to latest in preparation for DF54 May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant