Skip to content

refactor(bigframes): Extract json conversions to distinct ops#17473

Draft
TrevorBergeron wants to merge 1 commit into
mainfrom
tbergeron_json-cast_extract
Draft

refactor(bigframes): Extract json conversions to distinct ops#17473
TrevorBergeron wants to merge 1 commit into
mainfrom
tbergeron_json-cast_extract

Conversation

@TrevorBergeron

Copy link
Copy Markdown
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors JSON casting by replacing the generic AsTypeOp with explicit ToJSON and JSONDecode operations across the compilation backends (Ibis, Polars, and SQLGlot). Feedback suggests optimizing DataFrame.astype to avoid a performance regression caused by column-by-column assignment on non-JSON types, and improving type safety in the Polars compiler by using a more specific type annotation for the ToJSON operation.

Comment on lines +451 to +454
result = self.copy()
for col in result.columns:
result[col] = result[col].astype(dtype, errors=errors)
return result

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Looping over columns and performing individual series assignments (result[col] = result[col].astype(...)) introduces a significant performance regression for non-JSON data types. In bigframes, column-by-column assignment creates multiple intermediate DataFrame states and nested expression trees, which can lead to slow compilation times and complex SQL queries.

We can avoid this overhead by fast-pathing the common cases where no JSON columns are involved, or where all columns are being cast to/from JSON, applying the operation to all columns at once using _apply_unary_op.

        safe = errors == "null"
        dtype = bigframes.dtypes.bigframes_type(dtype)

        if dtype == bigframes.dtypes.JSON_DTYPE:
            return self._apply_unary_op(ops.ToJSON(safe=safe))

        if all(t == bigframes.dtypes.JSON_DTYPE for t in self.dtypes):
            return self._apply_unary_op(ops.JSONDecode(to_type=dtype, safe=safe))

        if not any(t == bigframes.dtypes.JSON_DTYPE for t in self.dtypes):
            return self._apply_unary_op(ops.AsTypeOp(to_type=dtype, safe=safe))

        result = self.copy()
        for col in result.columns:
            result[col] = result[col].astype(dtype, errors=errors)
        return result

return input.str.json_decode(_DTYPE_MAPPING[op.to_type])

@compile_op.register(json_ops.ToJSON)
def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and better type safety, annotate the op parameter with the specific json_ops.ToJSON type instead of the generic ops.ScalarOp, matching the pattern used in other registered operations (such as ops.ToArrayOp and json_ops.JSONDecode).

Suggested change
def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr:
def _(self, op: json_ops.ToJSON, input: pl.Expr) -> pl.Expr:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant