-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37465][PYTHON] Bump minimum pandas version to 1.0.5 #34717
Conversation
Just to start the discussion, by using below sql according [1], we can get the all download stat of Pandas in last 3 months. SELECT
file.version AS file_version,
COUNT(*) AS num_downloads,
FROM `the-psf.pypi.file_downloads`
WHERE file.project = 'pandas'
AND
-- Only query the last 3 months of history
DATE(timestamp)
BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH)
AND CURRENT_DATE()
GROUP BY `file_version`
ORDER BY `num_downloads` DESC Here is the Top 20 data, about 77% of the overall data, complete result can be found in here:
[1] https://packaging.python.org/guides/analyzing-pypi-package-downloads/ |
Test build #145645 has finished for PR 34717 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
7986d55
to
e521b76
Compare
Test build #145671 has finished for PR 34717 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK to go ahead and require the stable-r 1.x release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. cc @ueshin @BryanCutler @viirya @xinrong-databricks @itholic FYI
I noticed that Test failure
|
Yeah, it seems the bug in pandas 1.0.0.
>>> pser = pd.Series([1, 2, 3, None], dtype="Int8")
>>> pser
0 1
1 2
2 3
3 <NA>
dtype: Int8
>>> ~pser
0 -2
1 -3
2 -4
3 <NA>
dtype: object # this should've been `Int8` Resolved in pandas 1.0.1.
>>> pser = pd.Series([1, 2, 3, None], dtype="Int8")
>>> pser
0 1
1 2
2 3
3 <NA>
dtype: Int8
>>> ~pser
0 -2
1 -3
2 -4
3 <NA>
dtype: Int8 For addressing this,
Not sure which way is better, but I think we can just go with |
Yeah just require 1.0.1 for this reason |
Thanks @Yikun, how do you think about bumping to |
Sure, thanks for your suggestion, I'd like to update. and I added a simple test to install pandas v1.0.1 : (, Update: pandas only publish ubuntu wheel after v1.2....we have to install many deps, otherwise it would be failed when using and looks like there were some testcase are failed like: Test failure
|
Complete all
Serveral test cases failed (4 cases failed due to same issue) in 1.0.1 due to Test failure details
At this time, I prefer to update to 1.0.5, I'm going to run |
There are only a precision error of Test failure details
|
e521b76
to
7b1de6d
Compare
As a conclusion in here:
So, I bump minimum pandas version to v1.0.5, the v1.0.5 is also the latest version of Pandas verions 1.0. Ready for review. : ) |
Test build #145720 has finished for PR 34717 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems okay. One comment about the doc.
@@ -387,7 +387,7 @@ working with timestamps in ``pandas_udf``\s to get the best performance, see | |||
Recommended Pandas and PyArrow Versions | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
For usage with pyspark.sql, the minimum supported versions of Pandas is 0.23.2 and PyArrow is 1.0.0. | |||
For usage with pyspark.sql, the minimum supported versions of Pandas is 1.0.5 and PyArrow is 1.0.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention there are some issues with versions like 1.0.0, 1.0.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
For usage with pyspark.sql, the minimum supported versions of Pandas is 1.0.5 and PyArrow is 1.0.0. Lower versions (such as there are some known issues under with v1.0.0, v1.0.1, see more in link) or higher versions may be used, however, compatibility and data correctness can not be guaranteed and should be verified by the user.
Maybe need more suggestion from native speaker. T_T, and if it's necessary we could do it in next commits in this PR or followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think v1.0.5 is a reasonable minimum
LGTM if remaining comments are resolved. |
7b1de6d
to
054905f
Compare
Test build #145744 has finished for PR 34717 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Merged to master. |
What changes were proposed in this pull request?
Bump minimum pandas version to 1.0.5 (or a better version)
Why are the changes needed?
Initial discussion from SPARK-37465 and #34314 (comment) .
Does this PR introduce any user-facing change?
Yes, bump pandas minimun version.
How was this patch tested?
PySpark test passed with pandas v1.0.5.