Skip to content

Commit

Permalink
[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed.

## How was this patch tested?

Existing Tests

Closes #24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  • Loading branch information
BryanCutler authored and HyukjinKwon committed Jun 18, 2019
1 parent 4576dfd commit 90f8039
Show file tree
Hide file tree
Showing 5 changed files with 8 additions and 6 deletions.
4 changes: 4 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ license: |
{:toc}

## Upgrading From Spark SQL 2.4 to 3.0
- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.

- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.

- In Spark version 2.4 and earlier, SQL queries such as `FROM <table>` or `FROM <table> UNION ALL FROM <table>` are supported by accident. In hive-style `FROM <table> SELECT <expr>`, the `SELECT` clause is not negligible. Neither Hive nor Presto support this syntax. Therefore we will treat these queries as invalid since Spark 3.0.

- Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not deprecated any more. It is an alias for `union`.
Expand Down
2 changes: 0 additions & 2 deletions python/pyspark/serializers.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,8 +297,6 @@ def create_array(s, t):
# Ensure timestamp series are in expected form for Spark internal representation
if t is not None and pa.types.is_timestamp(t):
s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone)
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2
return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)

try:
array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
Expand Down
4 changes: 2 additions & 2 deletions python/pyspark/sql/tests/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,10 +268,10 @@ def test_createDataFrame_with_schema(self):
def test_createDataFrame_with_incorrect_schema(self):
pdf = self.create_pandas_data_frame()
fields = list(self.schema)
fields[0], fields[7] = fields[7], fields[0] # swap str with timestamp
fields[0], fields[1] = fields[1], fields[0] # swap str with int
wrong_schema = StructType(fields)
with QuietTest(self.sc):
with self.assertRaisesRegexp(Exception, ".*cast.*[s|S]tring.*timestamp.*"):
with self.assertRaisesRegexp(Exception, "integer.*required.*got.*str"):
self.spark.createDataFrame(pdf, schema=wrong_schema)

def test_createDataFrame_with_names(self):
Expand Down
2 changes: 1 addition & 1 deletion python/pyspark/sql/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def require_minimum_pandas_version():
""" Raise ImportError if minimum version of Pandas is not installed
"""
# TODO(HyukjinKwon): Relocate and deduplicate the version specification.
minimum_pandas_version = "0.19.2"
minimum_pandas_version = "0.23.2"

from distutils.version import LooseVersion
try:
Expand Down
2 changes: 1 addition & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def _supports_symlinks():
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py
# For Arrow, you should also check ./pom.xml and ensure there are no breaking changes in the
# binary format protocol with the Java version, see ARROW_HOME/format/* for specifications.
_minimum_pandas_version = "0.19.2"
_minimum_pandas_version = "0.23.2"
_minimum_pyarrow_version = "0.12.1"

try:
Expand Down

0 comments on commit 90f8039

Please sign in to comment.