Skip to content

[DISCUSSION] Adopt datafusion-functions-json and datafusion-variant into core repo #21301

@alamb

Description

@alamb

Introduction

As semi-structured data processing becomes more important, I hear from more and more DataFusion users that they would like better support for JSON and Parquet Variant functions (see this blog post for more details).

Today, function libraries for these two areas live in datafusion-contrib rather than in the main Apache DataFusion repository:

Keeping them outside the main repository has benefits, such as faster iteration and not being tied to ASF release cycles.

However, it also means they are:

  1. outside ASF governance and release processes
  2. less discoverable to users
  3. harder to integrate and version downstream alongside DataFusion releases
  4. perceived as more experimental even when they solve common problems
  5. potentially harder to attract outside maintenance contributions

Proposal

Bring these two crates into the main DataFusion repository, similarly to how we did for Spark-related functionality. The crates would remain optional and would not become part of the main datafusion crate or a default feature flag.

I think we would need buy-in from the current maintainers/authors, including @pydantic, @adriangb, @friendlymatthew, and others.

We previously did this for Spark-compatible functions by bringing datafusion-spark into the core DataFusion repo because the functionality was widely useful and maintaining it in one place made contribution and coordination easier.

There was also recent discussion on the mailing list about using JSON functionality in the Python bindings, where this topic also came up: https://lists.apache.org/thread/f591qmhx97wsl7h5xfoh7sfhv2gh9t2k

Alternatives you've considered

  1. Keep these crates in datafusion-contrib indefinitely.
    This keeps the core repo smaller and preserves flexibility, but leaves the crates outside the main project's release and governance process.
  2. Keep them in datafusion-contrib, but improve discoverability and documentation.
    This helps users find them, but does not address governance, release coordination, or long-term maintenance.
  3. Bring in only one library at a time, starting with the most mature or most widely used.
    This is likely the lowest-risk path if there is agreement in principle but uncertainty about scope.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions