-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867
[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867
Conversation
wrong_schema = StructType(fields) | ||
with QuietTest(self.sc): | ||
with self.assertRaisesRegexp(Exception, ".*cast.*[s|S]tring.*timestamp.*"): | ||
with self.assertRaisesRegexp(Exception, "integer.*required.*got.*str"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing the workaround changed this error message and it seemed more clear for the test to swap int field instead of timestamp
I thought we were testing in Jenkins with Pandas 0.23.2, but from this comment #24298 (comment), it looks like 0.24.2. I think 0.24.2 might be too new to use as a minimum supported version, but it makes using 0.23.2 kind of arbitrary. What do others think as a good version to use? |
0.23.2 sounds fine to me. but can we quickly discuss this one in dev mailing list? |
Test build #106492 has finished for PR 24867 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably good to keep the min version lower at 23.2 rather than 24.2
Agreed. 0.23.2 sounds good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, @BryanCutler . Could you update pandas_udf
function note together?
Can we have an item to the upgrade migration doc from 2.4 to 3.0 explicitly? For 0.19.2
, we had one line for this at Upgrading From Spark SQL 2.3 to 2.4
.
Thanks @dongjoon-hyun , that one is referencing the table of Pandas/PyArrow conversions, so the data would have to be rerun. @HyukjinKwon would you be able to do this as a followup?
Yup, good idea. I'll add that. |
- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc. | ||
|
||
- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about the minimum pyarrow version. Further down here https://github.com/apache/spark/pull/24867/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9L58 we talk about safe casting, which I think is still relevant so I won't modify it, unless it seems confusing to talk about versions < 0.12.1?
Test build #106531 has finished for PR 24867 at commit
|
Yup, will update it as a followup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Looks good to me too but can we hold it for few days just to let more people read the discussion (since now it's weekends)? |
Merged to master. |
This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed. Existing Tests Closes apache#24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed. Existing Tests Closes apache#24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error
Pandas >= 0.23.2 must be installed; however, your version was 0.XX
. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed.How was this patch tested?
Existing Tests