Skip to content

Conversation

@ashrithb
Copy link
Contributor

@ashrithb ashrithb commented Dec 7, 2025

What changes were proposed in this pull request?

This PR fixes the test_to_feather test failure with PyArrow 22.0.0 by filtering
non-serializable attrs (metrics, observed_metrics) before writing to feather format.

Changes:

  1. Modified to_feather() in pyspark/pandas/frame.py to filter out non-serializable
    attrs before passing to PyArrow
  2. Removed the @unittest.skipIf workaround from test_to_feather
  3. Added to_dict() methods to MetricValue, PlanMetrics, and PlanObservedMetrics
    for future utility (not used in the fix, but useful additions)

Why are the changes needed?

PyArrow 22.0.0 changed its behavior to serialize pandas DataFrame.attrs to JSON
metadata when writing Feather files. PySpark Spark Connect stores PlanMetrics and
PlanObservedMetrics objects in pdf.attrs, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable

Does this PR introduce any user-facing change?

No. The fix filters internal Spark metadata (metrics, observed_metrics) from attrs
only when writing to feather format. Code that directly accesses pdf.attrs["metrics"]
(like test_observe) continues to work with the original objects.

How was this patch tested?

  • Verified that pdf.attrs["metrics"][0].name still works (backward compatibility)
  • Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered
  • Removed the @unittest.skipIf workaround so test_to_feather now runs on all versions
  • All existing tests pass including test_observe which accesses attrs directly
  • Removed the @unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068") workaround so the test now runs on all PyArrow versions

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ashrithb .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…22.0.0

Filter non-serializable attrs (metrics, observed_metrics) before feather write
instead of converting them to dicts, which preserves backward compatibility
for code that accesses pdf.attrs directly.

Also adds to_dict() methods to MetricValue and PlanMetrics for future use.
@ashrithb ashrithb force-pushed the SPARK-54068-pyarrow-feather-planmetrics-fix branch from 3e11929 to 418ae0d Compare December 7, 2025 20:14
# JSON serializable. We clear these attrs since they are internal
# execution metadata not needed in the output file.
pdf.attrs = {k: v for k, v in pdf.attrs.items()
if k not in ("metrics", "observed_metrics")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observed_metrics instead of observed_metric_*?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, can we do this only for if LooseVersion(pa.__version__) < LooseVersion("22.0.0"): safely in order to avoid any regressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observed_metrics instead of observed_metric_*?

Yes, the key from what I can see set in core.py, a single key that has a list of PlanObservedMetrics objects, I think the usage is right here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, can we do this only for if LooseVersion(pa.__version__) < LooseVersion("22.0.0"): safely in order to avoid any regressions?

Hmm, yeah it may be better to err on the side of caution here even though the change is for internal metadata, I'll add this logic in then!

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54068][PYTHON] Fix PySpark feather serialization with PyArrow 22.0.0 [SPARK-54068][PYTHON] Fix to_feather to support PyArrow 22.0.0 Dec 8, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @ashrithb .

Merged to master/4.1 for Apache Spark 4.1.0.

dongjoon-hyun pushed a commit that referenced this pull request Dec 8, 2025
### What changes were proposed in this pull request?

This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering
non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format.

**Changes:**
1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable
   attrs before passing to PyArrow
2. Removed the `unittest.skipIf` workaround from `test_to_feather`
3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics`
   for future utility (not used in the fix, but useful additions)

### Why are the changes needed?

PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON
metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and
`PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable

### Does this PR introduce any user-facing change?

No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs
only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]`
(like `test_observe`) continues to work with the original objects.

### How was this patch tested?

- Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility)
- Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered
- Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions
- All existing tests pass including `test_observe` which accesses attrs directly
- Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix.

Authored-by: ashrithb <ashrithlb@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 4e1e995)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

I added you to the Apache Spark contributor group and assigned SPARK-54068 to you, @ashrithb .

Welcome to the Apache Spark community and congratulations for your first commit, @ashrithb .

xu20160924 pushed a commit to xu20160924/spark that referenced this pull request Dec 9, 2025
### What changes were proposed in this pull request?

This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering
non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format.

**Changes:**
1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable
   attrs before passing to PyArrow
2. Removed the `unittest.skipIf` workaround from `test_to_feather`
3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics`
   for future utility (not used in the fix, but useful additions)

### Why are the changes needed?

PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON
metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and
`PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable

### Does this PR introduce any user-facing change?

No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs
only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]`
(like `test_observe`) continues to work with the original objects.

### How was this patch tested?

- Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility)
- Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered
- Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions
- All existing tests pass including `test_observe` which accesses attrs directly
- Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix.

Authored-by: ashrithb <ashrithlb@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants