[SPARK-54068][PYTHON] Fix `to_feather` to support PyArrow 22.0.0 #53377

ashrithb · 2025-12-07T17:50:54Z

What changes were proposed in this pull request?

This PR fixes the test_to_feather test failure with PyArrow 22.0.0 by filtering
non-serializable attrs (metrics, observed_metrics) before writing to feather format.

Changes:

Modified to_feather() in pyspark/pandas/frame.py to filter out non-serializable
attrs before passing to PyArrow
Removed the @unittest.skipIf workaround from test_to_feather
Added to_dict() methods to MetricValue, PlanMetrics, and PlanObservedMetrics
for future utility (not used in the fix, but useful additions)

Why are the changes needed?

PyArrow 22.0.0 changed its behavior to serialize pandas DataFrame.attrs to JSON
metadata when writing Feather files. PySpark Spark Connect stores PlanMetrics and
PlanObservedMetrics objects in pdf.attrs, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable

Does this PR introduce any user-facing change?

No. The fix filters internal Spark metadata (metrics, observed_metrics) from attrs
only when writing to feather format. Code that directly accesses pdf.attrs["metrics"]
(like test_observe) continues to work with the original objects.

How was this patch tested?

Verified that pdf.attrs["metrics"][0].name still works (backward compatibility)
Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered
Removed the @unittest.skipIf workaround so test_to_feather now runs on all versions
All existing tests pass including test_observe which accesses attrs directly
Removed the @unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068") workaround so the test now runs on all PyArrow versions

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun

Thank you, @ashrithb .

dongjoon-hyun

cc @HyukjinKwon , @zhengruifeng

…22.0.0 Filter non-serializable attrs (metrics, observed_metrics) before feather write instead of converting them to dicts, which preserves backward compatibility for code that accesses pdf.attrs directly. Also adds to_dict() methods to MetricValue and PlanMetrics for future use.

dongjoon-hyun · 2025-12-07T20:33:55Z

python/pyspark/pandas/frame.py

+        # JSON serializable. We clear these attrs since they are internal
+        # execution metadata not needed in the output file.
+        pdf.attrs = {k: v for k, v in pdf.attrs.items()
+                     if k not in ("metrics", "observed_metrics")}


observed_metrics instead of observed_metric_*?

Maybe, can we do this only for if LooseVersion(pa.__version__) < LooseVersion("22.0.0"): safely in order to avoid any regressions?

observed_metrics instead of observed_metric_*?

Yes, the key from what I can see set in core.py, a single key that has a list of PlanObservedMetrics objects, I think the usage is right here.

Maybe, can we do this only for if LooseVersion(pa.__version__) < LooseVersion("22.0.0"): safely in order to avoid any regressions?

Hmm, yeah it may be better to err on the side of caution here even though the change is for internal metadata, I'll add this logic in then!

dongjoon-hyun

+1, LGTM. Thank you, @ashrithb .

Merged to master/4.1 for Apache Spark 4.1.0.

### What changes were proposed in this pull request? This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format. **Changes:** 1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable attrs before passing to PyArrow 2. Removed the `unittest.skipIf` workaround from `test_to_feather` 3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics` for future utility (not used in the fix, but useful additions) ### Why are the changes needed? PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and `PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable ### Does this PR introduce any user-facing change? No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]` (like `test_observe`) continues to work with the original objects. ### How was this patch tested? - Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility) - Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered - Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions - All existing tests pass including `test_observe` which accesses attrs directly - Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix. Authored-by: ashrithb <ashrithlb@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4e1e995) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2025-12-08T05:20:31Z

I added you to the Apache Spark contributor group and assigned SPARK-54068 to you, @ashrithb .

Welcome to the Apache Spark community and congratulations for your first commit, @ashrithb .

### What changes were proposed in this pull request? This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format. **Changes:** 1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable attrs before passing to PyArrow 2. Removed the `unittest.skipIf` workaround from `test_to_feather` 3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics` for future utility (not used in the fix, but useful additions) ### Why are the changes needed? PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and `PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable ### Does this PR introduce any user-facing change? No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]` (like `test_observe`) continues to work with the original objects. ### How was this patch tested? - Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility) - Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered - Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions - All existing tests pass including `test_observe` which accesses attrs directly - Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix. Authored-by: ashrithb <ashrithlb@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added SQL PYTHON PANDAS API ON SPARK CONNECT labels Dec 7, 2025

dongjoon-hyun reviewed Dec 7, 2025

View reviewed changes

ashrithb force-pushed the SPARK-54068-pyarrow-feather-planmetrics-fix branch from 3e11929 to 418ae0d Compare December 7, 2025 20:14

dongjoon-hyun reviewed Dec 7, 2025

View reviewed changes

addressing comments, adding in version check

43d2fa6

dongjoon-hyun changed the title ~~[SPARK-54068][PYTHON] Fix PySpark feather serialization with PyArrow 22.0.0~~ [SPARK-54068][PYTHON] Fix to_feather to support PyArrow 22.0.0 Dec 8, 2025

dongjoon-hyun approved these changes Dec 8, 2025

View reviewed changes

dongjoon-hyun closed this in 4e1e995 Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54068][PYTHON] Fix `to_feather` to support PyArrow 22.0.0 #53377

[SPARK-54068][PYTHON] Fix `to_feather` to support PyArrow 22.0.0 #53377

Uh oh!

ashrithb commented Dec 7, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun Dec 7, 2025

Uh oh!

dongjoon-hyun Dec 7, 2025

Uh oh!

ashrithb Dec 8, 2025

Uh oh!

ashrithb Dec 8, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54068][PYTHON] Fix to_feather to support PyArrow 22.0.0 #53377

[SPARK-54068][PYTHON] Fix to_feather to support PyArrow 22.0.0 #53377

Uh oh!

Conversation

ashrithb commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

ashrithb Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ashrithb Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54068][PYTHON] Fix `to_feather` to support PyArrow 22.0.0 #53377

[SPARK-54068][PYTHON] Fix `to_feather` to support PyArrow 22.0.0 #53377

ashrithb commented Dec 7, 2025 •

edited

Loading