Skip to content

Conversation

@shuoweil
Copy link
Contributor

@shuoweil shuoweil commented Oct 31, 2025

This commit addresses issue where creating empty DataFrames with nested JSON columns would fail due to PyArrow's inability to create empty arrays with db_dtypes.JSONArrowType (Apache Arrow issue #45262).

Changes:

  • First tries to create an empty Arrow table directly from the schema
  • If that fails with pa.ArrowNotImplementedError, falls back to using storage types (use_storage_types=True)
  • Converts the Arrow table to pandas, which properly preserves dtypes

This workaround is specifically needed for the anywidget backend which uses to_pandas_batches()

Fixes #<456577463> 🦕

@shuoweil shuoweil self-assigned this Oct 31, 2025
@shuoweil shuoweil requested review from a team as code owners October 31, 2025 00:02
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Oct 31, 2025
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: l Pull request size is large. labels Oct 31, 2025
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Look good to me once typing and presubmit failures are addressed.

Comment on lines 725 to 748
for col in itertools.chain(self.value_columns, self.index_columns):
dtype = self.expr.get_column_type(col)
if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype):
# Due to a limitation in Apache Arrow (#45262), JSON columns are not
# natively supported by the to_pandas_batches() method, which is
# used by the anywidget backend.
# Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273
# PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType,
# especially when nested.
# Create with string type and then cast.

# MyPy doesn't automatically narrow the type of 'dtype' here,
# so we add an explicit check.
if isinstance(dtype, pd.ArrowDtype):
safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string(
dtype.pyarrow_dtype
)
safe_dtype = pd.ArrowDtype(safe_pa_type)
series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype)
else:
# This branch should ideally not be reached if
# contains_db_dtypes_json_dtype is accurate,
# but it's here for MyPy's sake.
series_map[col] = pd.Series([], dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chelsea-lin I assume we have similar code that does this, right? Maybe there's something that could be reused here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have something similar in the loader component but they're slightly different.

def _validate_dtype_can_load(name: str, column_type: bigframes.dtypes.Dtype):

Also, I agree that we can simply logic a little bit, for example:

dtype = pd.ArrowDtype(pa.list_(pa.struct([("key", db_dtypes.JSONArrowType())])))
try:
    s = pd.Series([], dtype=dtype)
except pa.ArrowNotImplementedError as e:
    s = pd.Series([], dtype=pd.ArrowDtype(_replace_json_arrow_with_string(dtype.pyarrow_dtype))).astype(dtype)

Copy link
Contributor Author

@shuoweil shuoweil Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic has been simplified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The new logic looks even better!

@tswast tswast changed the title fix: Pyarrow limitation with empty nested JSON arrays in to_pandas_batches() fix: support results with STRUCT and ARRAY columns containing JSON subfields in to_pandas_batches() Oct 31, 2025
@tswast
Copy link
Collaborator

tswast commented Oct 31, 2025

Nit: I renamed the PR to be a little more user-oriented. Users don't care as much about the internal limitations. What changed from their perspective is that they can read STRUCT<JSON> and ARRAY<JSON> now.

return contains_db_dtypes_json_arrow_type(dtype.pyarrow_dtype)


def _replace_json_arrow_with_string(pa_type: pa.DataType) -> pa.DataType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function may be similar as the following two methods. Can you help to remove the one in the loader.py?

def _has_json_arrow_type(arrow_type: pa.DataType) -> bool:

def contains_db_dtypes_json_arrow_type(type_):

Copy link
Contributor Author

@shuoweil shuoweil Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I removed this function, this code refactor is no longer relevant to this PR. I will start a new PR (#2221) for this code refactor.

  bigframes/core/blocks.py, unused function removed from bigframes/dtypes.py
  construction of the empty DataFrame with the more robust try...except block
  that leverages to_pyarrow and empty_table
@shuoweil shuoweil added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 1, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 1, 2025
@shuoweil shuoweil added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 3, 2025
@bigframes-bot bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 3, 2025
try:
empty_arrow_table = self.expr.schema.to_pyarrow().empty_table()
except pa.ArrowNotImplementedError:
# Bug with some pyarrow versions, empty_table only supports base storage types, not extension types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please add the bug id in the docs.

Comment on lines 725 to 748
for col in itertools.chain(self.value_columns, self.index_columns):
dtype = self.expr.get_column_type(col)
if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype):
# Due to a limitation in Apache Arrow (#45262), JSON columns are not
# natively supported by the to_pandas_batches() method, which is
# used by the anywidget backend.
# Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273
# PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType,
# especially when nested.
# Create with string type and then cast.

# MyPy doesn't automatically narrow the type of 'dtype' here,
# so we add an explicit check.
if isinstance(dtype, pd.ArrowDtype):
safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string(
dtype.pyarrow_dtype
)
safe_dtype = pd.ArrowDtype(safe_pa_type)
series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype)
else:
# This branch should ideally not be reached if
# contains_db_dtypes_json_dtype is accurate,
# but it's here for MyPy's sake.
series_map[col] = pd.Series([], dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The new logic looks even better!

@shuoweil shuoweil requested a review from chelsea-lin November 3, 2025 23:58
@shuoweil shuoweil enabled auto-merge (squash) November 3, 2025 23:59
@shuoweil shuoweil disabled auto-merge November 4, 2025 00:10
@shuoweil shuoweil merged commit 3d8b17f into main Nov 4, 2025
25 checks passed
@shuoweil shuoweil deleted the shuowei-json-empty-dataframe branch November 4, 2025 01:27
sycai pushed a commit that referenced this pull request Nov 10, 2025
🤖 I have created a release *beep* *boop*
---


##
[2.29.0](v2.28.0...v2.29.0)
(2025-11-10)


### Features

* Add bigframes.bigquery.st_regionstats to join raster data from Earth
Engine
([#2228](#2228))
([10ec52f](10ec52f))
* Add DataFrame.resample and Series.resample
([#2213](#2213))
([c9ca02c](c9ca02c))
* SQL Cell no longer escapes formatted string values
([#2245](#2245))
([d2d38f9](d2d38f9))
* Support left_index and right_index for merge
([#2220](#2220))
([da9ba26](da9ba26))


### Bug Fixes

* Correctly iterate over null struct values in ManagedArrowTable
([#2209](#2209))
([12e04d5](12e04d5))
* Simplify UnsupportedTypeError message
([#2212](#2212))
([6c9a18d](6c9a18d))
* Support results with STRUCT and ARRAY columns containing JSON
subfields in `to_pandas_batches()`
([#2216](#2216))
([3d8b17f](3d8b17f))


### Documentation

* Switch API reference docs to pydata theme
([#2237](#2237))
([9b86dcf](9b86dcf))
* Update notebook for JSON subfields support in to_pandas_batches()
([#2138](#2138))
([5663d2a](5663d2a))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants