[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298

BryanCutler · 2019-04-04T19:03:31Z

What changes were proposed in this pull request?

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an ImportError will be raised to indicate an upgrade is needed.

How was this patch tested?

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

BryanCutler · 2019-04-04T19:04:30Z

Need to check about Pandas / Numpy requirements that might go along with this.

BryanCutler · 2019-04-04T19:06:22Z

I think we go with 0.12.1 because of https://issues.apache.org/jira/browse/ARROW-4582, might cause a problem

BryanCutler · 2019-04-04T19:08:19Z

@shaneknapp this passes locally for me so lets give it a shot with the new environment when you have time, thanks! cc @HyukjinKwon @ueshin

SparkQA · 2019-04-04T19:39:09Z

Test build #104301 has finished for PR 24298 at commit 5f91d98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-04-05T02:06:31Z

python/pyspark/serializers.py

@@ -299,14 +300,6 @@ def create_array(s, t):
                # TODO: don't need as of Arrow 0.9.1


Do we still need this?

I'll check this also

No, it's not needed. Tests passed with Python 2.7 and I manually did a string column conversion to arrow.

ueshin · 2019-04-05T02:12:19Z

python/pyspark/serializers.py

@@ -289,7 +291,6 @@ def _create_batch(self, series):
        def create_array(s, t):
            mask = s.isnull()
            # Ensure timestamp series are in expected form for Spark internal representation
-            # TODO: maybe don't need None check anymore as of Arrow 0.9.1
            if t is not None and pa.types.is_timestamp(t):
                s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone)
                # TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2


Do we still need the workaround .from_pandas .. .cast here?

I didn't want to change this since it was related to a pandas version, I can double-check though

ueshin · 2019-04-05T02:27:37Z

python/pyspark/serializers.py

@@ -289,7 +291,6 @@ def _create_batch(self, series):
        def create_array(s, t):
            mask = s.isnull()


Do we still need to use mask?

I'm not sure, I'll check

Yes, it's need to correctly insert NULL values in timestamps, since there is a fillna(0) done on the series.

BryanCutler · 2019-04-05T19:03:00Z

@shaneknapp are you going to leave Pandas at version 0.19.2 or is that being upgraded as well for the python 3.6 env?

Since pyarrow 0.12.1 requires numpy >= 1.14, I'm not sure if older versions of Pandas will work

Conda doesn't like it, maybe have to manually override

UnsatisfiableError: The following specifications were found to be in conflict:
  - pandas=0.19.2
  - pyarrow=0.12.1 -> arrow-cpp[version='>=0.12.1,<0.13.0a0,>=0.12.1,<1.0a0'] -> numpy[version='>=1.14,<2.0a0'] -> numpy-base==1.16.2=py36hde5b4d6_0

SparkQA · 2019-04-05T21:05:09Z

Test build #104331 has finished for PR 24298 at commit 057f505.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2019-04-06T05:37:51Z

python/setup.py

-# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py and
-# ./python/run-tests.py. In case of Arrow, you should also check ./pom.xml.
+# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py
+# For Arrow, you should also check ./pom.xml and ensure the Java version is binary compatible.


can you expand on what and ensure the Java version is binary compatible means?

sure, will do

SparkQA · 2019-04-08T18:50:06Z

Test build #104402 has finished for PR 24298 at commit cc9a305.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ounds

BryanCutler · 2019-04-09T23:32:24Z

Also to note that Arrow v0.13.0 has recently been released. There were no breaking changes so it is still compatible and increasing the minimum to that version wouldn't help clean up this code anymore. Version 0.12.1 has been pretty stable so far and I still think it's the best choice for a minimum supported version right now.

SparkQA · 2019-04-09T23:56:19Z

Test build #104459 has finished for PR 24298 at commit 852dc0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2019-04-19T17:15:40Z

test this please

SparkQA · 2019-04-19T17:53:28Z

Test build #104756 has finished for PR 24298 at commit 852dc0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-04-19T19:38:09Z

From the test output, it passed all pyarrow tests on Python 3.6 with pyarrow 0.12.1, so I think this is good to go! It looks like pyarrow 0.8.0 is installed in the Python 2.7 environment, which shows tests skipped messages. This isn't a problem since we do Arrow testing with Python 3, so the old version of pyarrow can be uninstalled sometime later to clean up the test output. Let's test this one more time and I'll merge later if no more comments.

BryanCutler · 2019-04-19T19:38:17Z

test this please

shaneknapp · 2019-04-19T19:48:25Z

From the test output, it passed all pyarrow tests on Python 3.6 with pyarrow 0.12.1, so I think this is good to go! It looks like pyarrow 0.8.0 is installed in the Python 2.7 environment, which shows tests skipped messages. This isn't a problem since we do Arrow testing with Python 3, so the old version of pyarrow can be uninstalled sometime later to clean up the test output. Let's test this one more time and I'll merge later if no more comments.

EDIT: i was looking at the wrong build :)

SparkQA · 2019-04-19T20:14:23Z

Test build #104761 has finished for PR 24298 at commit 852dc0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-04-19T20:30:34Z

Update: @shaneknapp is planning to upgrade the Python 2.7 env to use Pandas 0.24.2 and Pyarrow 0.12.1 also, which will be good to verify Arrow tests pass with both Python 2 & 3. This will hopefully be done on Monday and then this PR can be merged after.

shaneknapp · 2019-04-19T20:44:45Z

Update: @shaneknapp is planning to upgrade the Python 2.7 env to use Pandas 0.24.2 and Pyarrow 0.12.1 also, which will be good to verify Arrow tests pass with both Python 2 & 3. This will hopefully be done on Monday and then this PR can be merged after.

actually, i just had a thought:

all spark branches (master, 2.3, 2.4) use the same python2.7 env... will this impact the older branches negatively?

i can (and will) test to confirm.

BryanCutler · 2019-04-19T20:52:07Z

all spark branches (master, 2.3, 2.4) use the same python2.7 env

I forgot about the branches sharing the same env, in that case we definitely don't want to upgrade pyarrow for the 2.3/2.4 branches. I think it's ok if we leave as is, then master will just skip arrow tests for Python 2 and we are still running the same tests with Python 3.

shaneknapp · 2019-04-19T20:54:24Z

I forgot about the branches sharing the same env, in that case we definitely don't want to upgrade pyarrow for the branches. I think it's ok if we leave as is, then master will just skip arrow tests for Python 2 and we are still running the same tests with Python 3.

sgtm++

HyukjinKwon · 2019-04-22T10:29:55Z

Merged to master.

BryanCutler · 2019-04-22T17:26:19Z

Thanks all! I'll keep an eye out on Jenkins and make sure this is running ok.

shaneknapp · 2019-04-22T22:13:07Z

this build was triggered by the merge:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5764

all the python tests passed, but holy CRAP the "skipped tests" output is super verbose and could use a for-serious refactor.

…2.1 and remove prior workarounds ## What changes were proposed in this pull request? This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. ## How was this patch tested? Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…2.1 and remove prior workarounds This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

* [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * Fix pandas infer_dtype warning * [SPARK-27276][PYTHON][DOCS][FOLLOW-UP] Update documentation about Arrow version in PySpark as well ## What changes were proposed in this pull request? Looks updating documentation from 0.8.0 to 0.12.1 was missed. ## How was this patch tested? N/A Closes apache#24504 from HyukjinKwon/SPARK-27276-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> Co-authored-by: Bryan Cutler <cutlerb@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org>

…2.1 and remove prior workarounds This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

ueshin reviewed Apr 5, 2019

View reviewed changes

BryanCutler mentioned this pull request Apr 5, 2019

[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

Closed

felixcheung reviewed Apr 6, 2019

View reviewed changes

BryanCutler added 3 commits April 9, 2019 16:18

increase minimum version of pyarrow to 0.12.1 and remove prior workar…

87dc661

…ounds

remove workaround to cast string col for Python 2

c3e0c26

Expanded note for Java compat

852dc0f

BryanCutler force-pushed the arrow-bump-min-pyarrow-SPARK-27276 branch from cc9a305 to 852dc0f Compare April 9, 2019 23:20

BryanCutler mentioned this pull request Apr 9, 2019

[SPARK-25079][python] update python3 executable to 3.6.x #24266

Closed

HyukjinKwon closed this in d36cce1 Apr 22, 2019

BryanCutler deleted the arrow-bump-min-pyarrow-SPARK-27276 branch April 22, 2019 17:29

BryanCutler mentioned this pull request Jun 14, 2019

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

Closed

rshkv mentioned this pull request Feb 28, 2020

Bump pyarrow to 0.12.1 palantir/spark#649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298

[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298

BryanCutler commented Apr 4, 2019 •

edited

Loading

BryanCutler commented Apr 4, 2019

BryanCutler commented Apr 4, 2019

BryanCutler commented Apr 4, 2019

SparkQA commented Apr 4, 2019

ueshin Apr 5, 2019

BryanCutler Apr 5, 2019

BryanCutler Apr 5, 2019 •

edited

Loading

ueshin Apr 5, 2019

BryanCutler Apr 5, 2019

ueshin Apr 5, 2019

BryanCutler Apr 5, 2019

BryanCutler Apr 5, 2019

BryanCutler commented Apr 5, 2019 •

edited

Loading

SparkQA commented Apr 5, 2019

felixcheung Apr 6, 2019

BryanCutler Apr 8, 2019

SparkQA commented Apr 8, 2019

BryanCutler commented Apr 9, 2019

SparkQA commented Apr 9, 2019

shaneknapp commented Apr 19, 2019

SparkQA commented Apr 19, 2019

BryanCutler commented Apr 19, 2019

BryanCutler commented Apr 19, 2019

shaneknapp commented Apr 19, 2019 •

edited

Loading

SparkQA commented Apr 19, 2019

BryanCutler commented Apr 19, 2019

shaneknapp commented Apr 19, 2019

BryanCutler commented Apr 19, 2019 •

edited

Loading

shaneknapp commented Apr 19, 2019

HyukjinKwon commented Apr 22, 2019

BryanCutler commented Apr 22, 2019

shaneknapp commented Apr 22, 2019

		@@ -299,14 +300,6 @@ def create_array(s, t):
		# TODO: don't need as of Arrow 0.9.1

		@@ -289,7 +291,6 @@ def _create_batch(self, series):
		def create_array(s, t):
		mask = s.isnull()

[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298

[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298

Conversation

BryanCutler commented Apr 4, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

BryanCutler commented Apr 4, 2019

BryanCutler commented Apr 4, 2019

BryanCutler commented Apr 4, 2019

SparkQA commented Apr 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Apr 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Apr 5, 2019 • edited Loading

SparkQA commented Apr 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 8, 2019

BryanCutler commented Apr 9, 2019

SparkQA commented Apr 9, 2019

shaneknapp commented Apr 19, 2019

SparkQA commented Apr 19, 2019

BryanCutler commented Apr 19, 2019

BryanCutler commented Apr 19, 2019

shaneknapp commented Apr 19, 2019 • edited Loading

SparkQA commented Apr 19, 2019

BryanCutler commented Apr 19, 2019

shaneknapp commented Apr 19, 2019

BryanCutler commented Apr 19, 2019 • edited Loading

shaneknapp commented Apr 19, 2019

HyukjinKwon commented Apr 22, 2019

BryanCutler commented Apr 22, 2019

shaneknapp commented Apr 22, 2019

BryanCutler commented Apr 4, 2019 •

edited

Loading

BryanCutler Apr 5, 2019 •

edited

Loading

BryanCutler commented Apr 5, 2019 •

edited

Loading

shaneknapp commented Apr 19, 2019 •

edited

Loading

BryanCutler commented Apr 19, 2019 •

edited

Loading