Introduction
As semi-structured data processing becomes more important, I hear from more and more DataFusion users that they would like better support for JSON and Parquet Variant functions (see this blog post for more details).
Today, function libraries for these two areas live in datafusion-contrib rather than in the main Apache DataFusion repository:
Keeping them outside the main repository has benefits, such as faster iteration and not being tied to ASF release cycles.
However, it also means they are:
- outside ASF governance and release processes
- less discoverable to users
- harder to integrate and version downstream alongside DataFusion releases
- perceived as more experimental even when they solve common problems
- potentially harder to attract outside maintenance contributions
Proposal
Bring these two crates into the main DataFusion repository, similarly to how we did for Spark-related functionality. The crates would remain optional and would not become part of the main datafusion crate or a default feature flag.
I think we would need buy-in from the current maintainers/authors, including @pydantic, @adriangb, @friendlymatthew, and others.
We previously did this for Spark-compatible functions by bringing datafusion-spark into the core DataFusion repo because the functionality was widely useful and maintaining it in one place made contribution and coordination easier.
There was also recent discussion on the mailing list about using JSON functionality in the Python bindings, where this topic also came up: https://lists.apache.org/thread/f591qmhx97wsl7h5xfoh7sfhv2gh9t2k
Alternatives you've considered
- Keep these crates in
datafusion-contrib indefinitely.
This keeps the core repo smaller and preserves flexibility, but leaves the crates outside the main project's release and governance process.
- Keep them in
datafusion-contrib, but improve discoverability and documentation.
This helps users find them, but does not address governance, release coordination, or long-term maintenance.
- Bring in only one library at a time, starting with the most mature or most widely used.
This is likely the lowest-risk path if there is agreement in principle but uncertainty about scope.
Introduction
As semi-structured data processing becomes more important, I hear from more and more DataFusion users that they would like better support for JSON and Parquet Variant functions (see this blog post for more details).
Today, function libraries for these two areas live in datafusion-contrib rather than in the main Apache DataFusion repository:
Keeping them outside the main repository has benefits, such as faster iteration and not being tied to ASF release cycles.
However, it also means they are:
Proposal
Bring these two crates into the main DataFusion repository, similarly to how we did for Spark-related functionality. The crates would remain optional and would not become part of the main
datafusioncrate or a default feature flag.I think we would need buy-in from the current maintainers/authors, including @pydantic, @adriangb, @friendlymatthew, and others.
We previously did this for Spark-compatible functions by bringing
datafusion-sparkinto the core DataFusion repo because the functionality was widely useful and maintaining it in one place made contribution and coordination easier.There was also recent discussion on the mailing list about using JSON functionality in the Python bindings, where this topic also came up: https://lists.apache.org/thread/f591qmhx97wsl7h5xfoh7sfhv2gh9t2k
Alternatives you've considered
datafusion-contribindefinitely.This keeps the core repo smaller and preserves flexibility, but leaves the crates outside the main project's release and governance process.
datafusion-contrib, but improve discoverability and documentation.This helps users find them, but does not address governance, release coordination, or long-term maintenance.
This is likely the lowest-risk path if there is agreement in principle but uncertainty about scope.