Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ license: |
{:toc}

## Upgrading From Spark SQL 2.4 to 3.0
- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.

- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.

Copy link
Member Author

@BryanCutler BryanCutler Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about the minimum pyarrow version. Further down here https://github.com/apache/spark/pull/24867/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9L58 we talk about safe casting, which I think is still relevant so I won't modify it, unless it seems confusing to talk about versions < 0.12.1?

- In Spark version 2.4 and earlier, SQL queries such as `FROM <table>` or `FROM <table> UNION ALL FROM <table>` are supported by accident. In hive-style `FROM <table> SELECT <expr>`, the `SELECT` clause is not negligible. Neither Hive nor Presto support this syntax. Therefore we will treat these queries as invalid since Spark 3.0.

- Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not deprecated any more. It is an alias for `union`.
Expand Down
2 changes: 0 additions & 2 deletions python/pyspark/serializers.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,8 +297,6 @@ def create_array(s, t):
# Ensure timestamp series are in expected form for Spark internal representation
if t is not None and pa.types.is_timestamp(t):
s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone)
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2
return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)

try:
array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
Expand Down
4 changes: 2 additions & 2 deletions python/pyspark/sql/tests/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,10 +268,10 @@ def test_createDataFrame_with_schema(self):
def test_createDataFrame_with_incorrect_schema(self):
pdf = self.create_pandas_data_frame()
fields = list(self.schema)
fields[0], fields[7] = fields[7], fields[0] # swap str with timestamp
fields[0], fields[1] = fields[1], fields[0] # swap str with int
wrong_schema = StructType(fields)
with QuietTest(self.sc):
with self.assertRaisesRegexp(Exception, ".*cast.*[s|S]tring.*timestamp.*"):
with self.assertRaisesRegexp(Exception, "integer.*required.*got.*str"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the workaround changed this error message and it seemed more clear for the test to swap int field instead of timestamp

self.spark.createDataFrame(pdf, schema=wrong_schema)

def test_createDataFrame_with_names(self):
Expand Down
2 changes: 1 addition & 1 deletion python/pyspark/sql/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def require_minimum_pandas_version():
""" Raise ImportError if minimum version of Pandas is not installed
"""
# TODO(HyukjinKwon): Relocate and deduplicate the version specification.
minimum_pandas_version = "0.19.2"
minimum_pandas_version = "0.23.2"

from distutils.version import LooseVersion
try:
Expand Down
2 changes: 1 addition & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def _supports_symlinks():
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py
# For Arrow, you should also check ./pom.xml and ensure there are no breaking changes in the
# binary format protocol with the Java version, see ARROW_HOME/format/* for specifications.
_minimum_pandas_version = "0.19.2"
_minimum_pandas_version = "0.23.2"
_minimum_pyarrow_version = "0.12.1"

try:
Expand Down