-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54068][PYTHON] Fix to_feather to support PyArrow 22.0.0
#53377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54068][PYTHON] Fix to_feather to support PyArrow 22.0.0
#53377
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ashrithb .
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…22.0.0 Filter non-serializable attrs (metrics, observed_metrics) before feather write instead of converting them to dicts, which preserves backward compatibility for code that accesses pdf.attrs directly. Also adds to_dict() methods to MetricValue and PlanMetrics for future use.
3e11929 to
418ae0d
Compare
python/pyspark/pandas/frame.py
Outdated
| # JSON serializable. We clear these attrs since they are internal | ||
| # execution metadata not needed in the output file. | ||
| pdf.attrs = {k: v for k, v in pdf.attrs.items() | ||
| if k not in ("metrics", "observed_metrics")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observed_metrics instead of observed_metric_*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, can we do this only for if LooseVersion(pa.__version__) < LooseVersion("22.0.0"): safely in order to avoid any regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observed_metricsinstead ofobserved_metric_*?
Yes, the key from what I can see set in core.py, a single key that has a list of PlanObservedMetrics objects, I think the usage is right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, can we do this only for
if LooseVersion(pa.__version__) < LooseVersion("22.0.0"):safely in order to avoid any regressions?
Hmm, yeah it may be better to err on the side of caution here even though the change is for internal metadata, I'll add this logic in then!
to_feather to support PyArrow 22.0.0
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @ashrithb .
Merged to master/4.1 for Apache Spark 4.1.0.
### What changes were proposed in this pull request? This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format. **Changes:** 1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable attrs before passing to PyArrow 2. Removed the `unittest.skipIf` workaround from `test_to_feather` 3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics` for future utility (not used in the fix, but useful additions) ### Why are the changes needed? PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and `PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable ### Does this PR introduce any user-facing change? No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]` (like `test_observe`) continues to work with the original objects. ### How was this patch tested? - Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility) - Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered - Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions - All existing tests pass including `test_observe` which accesses attrs directly - Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix. Authored-by: ashrithb <ashrithlb@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4e1e995) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
I added you to the Apache Spark contributor group and assigned SPARK-54068 to you, @ashrithb . Welcome to the Apache Spark community and congratulations for your first commit, @ashrithb . |
### What changes were proposed in this pull request? This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format. **Changes:** 1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable attrs before passing to PyArrow 2. Removed the `unittest.skipIf` workaround from `test_to_feather` 3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics` for future utility (not used in the fix, but useful additions) ### Why are the changes needed? PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and `PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable ### Does this PR introduce any user-facing change? No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]` (like `test_observe`) continues to work with the original objects. ### How was this patch tested? - Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility) - Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered - Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions - All existing tests pass including `test_observe` which accesses attrs directly - Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix. Authored-by: ashrithb <ashrithlb@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR fixes the
test_to_feathertest failure with PyArrow 22.0.0 by filteringnon-serializable attrs (
metrics,observed_metrics) before writing to feather format.Changes:
to_feather()inpyspark/pandas/frame.pyto filter out non-serializableattrs before passing to PyArrow
@unittest.skipIfworkaround fromtest_to_featherto_dict()methods toMetricValue,PlanMetrics, andPlanObservedMetricsfor future utility (not used in the fix, but useful additions)
Why are the changes needed?
PyArrow 22.0.0 changed its behavior to serialize pandas
DataFrame.attrsto JSONmetadata when writing Feather files. PySpark Spark Connect stores
PlanMetricsandPlanObservedMetricsobjects inpdf.attrs, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializableDoes this PR introduce any user-facing change?
No. The fix filters internal Spark metadata (
metrics,observed_metrics) from attrsonly when writing to feather format. Code that directly accesses
pdf.attrs["metrics"](like
test_observe) continues to work with the original objects.How was this patch tested?
pdf.attrs["metrics"][0].namestill works (backward compatibility)@unittest.skipIfworkaround sotest_to_feathernow runs on all versionstest_observewhich accesses attrs directly@unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")workaround so the test now runs on all PyArrow versionsWas this patch authored or co-authored using generative AI tooling?
No.