Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Apr 4, 2019

What changes were proposed in this pull request?

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an ImportError will be raised to indicate an upgrade is needed.

How was this patch tested?

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

@BryanCutler
Copy link
Member Author

Need to check about Pandas / Numpy requirements that might go along with this.

@BryanCutler
Copy link
Member Author

I think we go with 0.12.1 because of https://issues.apache.org/jira/browse/ARROW-4582, might cause a problem

@BryanCutler
Copy link
Member Author

@shaneknapp this passes locally for me so lets give it a shot with the new environment when you have time, thanks! cc @HyukjinKwon @ueshin

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104301 has finished for PR 24298 at commit 5f91d98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -299,14 +300,6 @@ def create_array(s, t):
# TODO: don't need as of Arrow 0.9.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this also

Copy link
Member Author

@BryanCutler BryanCutler Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not needed. Tests passed with Python 2.7 and I manually did a string column conversion to arrow.

@@ -289,7 +291,6 @@ def _create_batch(self, series):
def create_array(s, t):
mask = s.isnull()
# Ensure timestamp series are in expected form for Spark internal representation
# TODO: maybe don't need None check anymore as of Arrow 0.9.1
if t is not None and pa.types.is_timestamp(t):
s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone)
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the workaround .from_pandas .. .cast here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to change this since it was related to a pandas version, I can double-check though

@@ -289,7 +291,6 @@ def _create_batch(self, series):
def create_array(s, t):
mask = s.isnull()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to use mask?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I'll check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's need to correctly insert NULL values in timestamps, since there is a fillna(0) done on the series.

@BryanCutler
Copy link
Member Author

BryanCutler commented Apr 5, 2019

@shaneknapp are you going to leave Pandas at version 0.19.2 or is that being upgraded as well for the python 3.6 env?

Since pyarrow 0.12.1 requires numpy >= 1.14, I'm not sure if older versions of Pandas will work

Conda doesn't like it, maybe have to manually override

UnsatisfiableError: The following specifications were found to be in conflict:
  - pandas=0.19.2
  - pyarrow=0.12.1 -> arrow-cpp[version='>=0.12.1,<0.13.0a0,>=0.12.1,<1.0a0'] -> numpy[version='>=1.14,<2.0a0'] -> numpy-base==1.16.2=py36hde5b4d6_0

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104331 has finished for PR 24298 at commit 057f505.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

python/setup.py Outdated
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py and
# ./python/run-tests.py. In case of Arrow, you should also check ./pom.xml.
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py
# For Arrow, you should also check ./pom.xml and ensure the Java version is binary compatible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on what and ensure the Java version is binary compatible means?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104402 has finished for PR 24298 at commit cc9a305.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the arrow-bump-min-pyarrow-SPARK-27276 branch from cc9a305 to 852dc0f Compare April 9, 2019 23:20
@BryanCutler
Copy link
Member Author

Also to note that Arrow v0.13.0 has recently been released. There were no breaking changes so it is still compatible and increasing the minimum to that version wouldn't help clean up this code anymore. Version 0.12.1 has been pretty stable so far and I still think it's the best choice for a minimum supported version right now.

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104459 has finished for PR 24298 at commit 852dc0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104756 has finished for PR 24298 at commit 852dc0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

From the test output, it passed all pyarrow tests on Python 3.6 with pyarrow 0.12.1, so I think this is good to go! It looks like pyarrow 0.8.0 is installed in the Python 2.7 environment, which shows tests skipped messages. This isn't a problem since we do Arrow testing with Python 3, so the old version of pyarrow can be uninstalled sometime later to clean up the test output. Let's test this one more time and I'll merge later if no more comments.

@BryanCutler
Copy link
Member Author

test this please

@shaneknapp
Copy link
Contributor

shaneknapp commented Apr 19, 2019

From the test output, it passed all pyarrow tests on Python 3.6 with pyarrow 0.12.1, so I think this is good to go! It looks like pyarrow 0.8.0 is installed in the Python 2.7 environment, which shows tests skipped messages. This isn't a problem since we do Arrow testing with Python 3, so the old version of pyarrow can be uninstalled sometime later to clean up the test output. Let's test this one more time and I'll merge later if no more comments.

EDIT: i was looking at the wrong build :)

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104761 has finished for PR 24298 at commit 852dc0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Update: @shaneknapp is planning to upgrade the Python 2.7 env to use Pandas 0.24.2 and Pyarrow 0.12.1 also, which will be good to verify Arrow tests pass with both Python 2 & 3. This will hopefully be done on Monday and then this PR can be merged after.

@shaneknapp
Copy link
Contributor

Update: @shaneknapp is planning to upgrade the Python 2.7 env to use Pandas 0.24.2 and Pyarrow 0.12.1 also, which will be good to verify Arrow tests pass with both Python 2 & 3. This will hopefully be done on Monday and then this PR can be merged after.

actually, i just had a thought:

all spark branches (master, 2.3, 2.4) use the same python2.7 env... will this impact the older branches negatively?

i can (and will) test to confirm.

@BryanCutler
Copy link
Member Author

BryanCutler commented Apr 19, 2019

all spark branches (master, 2.3, 2.4) use the same python2.7 env

I forgot about the branches sharing the same env, in that case we definitely don't want to upgrade pyarrow for the 2.3/2.4 branches. I think it's ok if we leave as is, then master will just skip arrow tests for Python 2 and we are still running the same tests with Python 3.

@shaneknapp
Copy link
Contributor

I forgot about the branches sharing the same env, in that case we definitely don't want to upgrade pyarrow for the branches. I think it's ok if we leave as is, then master will just skip arrow tests for Python 2 and we are still running the same tests with Python 3.

sgtm++

@HyukjinKwon
Copy link
Member

Merged to master.

@BryanCutler
Copy link
Member Author

Thanks all! I'll keep an eye out on Jenkins and make sure this is running ok.

@BryanCutler BryanCutler deleted the arrow-bump-min-pyarrow-SPARK-27276 branch April 22, 2019 17:29
@shaneknapp
Copy link
Contributor

this build was triggered by the merge:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5764

all the python tests passed, but holy CRAP the "skipped tests" output is super verbose and could use a for-serious refactor.

rshkv pushed a commit to rshkv/spark that referenced this pull request Feb 27, 2020
…2.1 and remove prior workarounds

## What changes were proposed in this pull request?

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

## How was this patch tested?

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Apr 8, 2020
…2.1 and remove prior workarounds

## What changes were proposed in this pull request?

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

## How was this patch tested?

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request May 1, 2020
…2.1 and remove prior workarounds

## What changes were proposed in this pull request?

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

## How was this patch tested?

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request May 17, 2020
…2.1 and remove prior workarounds

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv added a commit to palantir/spark that referenced this pull request May 18, 2020
* [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>

* Fix pandas infer_dtype warning

* [SPARK-27276][PYTHON][DOCS][FOLLOW-UP] Update documentation about Arrow version in PySpark as well

## What changes were proposed in this pull request?

Looks updating documentation from 0.8.0 to 0.12.1 was missed.

## How was this patch tested?

N/A

Closes apache#24504 from HyukjinKwon/SPARK-27276-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

Co-authored-by: Bryan Cutler <cutlerb@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request May 21, 2020
…2.1 and remove prior workarounds

This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants