-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298
[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds #24298
Conversation
Need to check about Pandas / Numpy requirements that might go along with this. |
I think we go with 0.12.1 because of https://issues.apache.org/jira/browse/ARROW-4582, might cause a problem |
@shaneknapp this passes locally for me so lets give it a shot with the new environment when you have time, thanks! cc @HyukjinKwon @ueshin |
Test build #104301 has finished for PR 24298 at commit
|
python/pyspark/serializers.py
Outdated
@@ -299,14 +300,6 @@ def create_array(s, t): | |||
# TODO: don't need as of Arrow 0.9.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check this also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not needed. Tests passed with Python 2.7 and I manually did a string column conversion to arrow.
@@ -289,7 +291,6 @@ def _create_batch(self, series): | |||
def create_array(s, t): | |||
mask = s.isnull() | |||
# Ensure timestamp series are in expected form for Spark internal representation | |||
# TODO: maybe don't need None check anymore as of Arrow 0.9.1 | |||
if t is not None and pa.types.is_timestamp(t): | |||
s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone) | |||
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the workaround .from_pandas
.. .cast
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to change this since it was related to a pandas version, I can double-check though
@@ -289,7 +291,6 @@ def _create_batch(self, series): | |||
def create_array(s, t): | |||
mask = s.isnull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to use mask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's need to correctly insert NULL values in timestamps, since there is a fillna(0)
done on the series.
@shaneknapp are you going to leave Pandas at version 0.19.2 or is that being upgraded as well for the python 3.6 env? Since pyarrow 0.12.1 requires numpy >= 1.14, I'm not sure if older versions of Pandas will work Conda doesn't like it, maybe have to manually override
|
Test build #104331 has finished for PR 24298 at commit
|
python/setup.py
Outdated
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py and | ||
# ./python/run-tests.py. In case of Arrow, you should also check ./pom.xml. | ||
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py | ||
# For Arrow, you should also check ./pom.xml and ensure the Java version is binary compatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand on what and ensure the Java version is binary compatible
means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do
Test build #104402 has finished for PR 24298 at commit
|
cc9a305
to
852dc0f
Compare
Also to note that Arrow v0.13.0 has recently been released. There were no breaking changes so it is still compatible and increasing the minimum to that version wouldn't help clean up this code anymore. Version 0.12.1 has been pretty stable so far and I still think it's the best choice for a minimum supported version right now. |
Test build #104459 has finished for PR 24298 at commit
|
test this please |
Test build #104756 has finished for PR 24298 at commit
|
From the test output, it passed all pyarrow tests on Python 3.6 with pyarrow 0.12.1, so I think this is good to go! It looks like pyarrow 0.8.0 is installed in the Python 2.7 environment, which shows tests skipped messages. This isn't a problem since we do Arrow testing with Python 3, so the old version of pyarrow can be uninstalled sometime later to clean up the test output. Let's test this one more time and I'll merge later if no more comments. |
test this please |
EDIT: i was looking at the wrong build :) |
Test build #104761 has finished for PR 24298 at commit
|
Update: @shaneknapp is planning to upgrade the Python 2.7 env to use Pandas 0.24.2 and Pyarrow 0.12.1 also, which will be good to verify Arrow tests pass with both Python 2 & 3. This will hopefully be done on Monday and then this PR can be merged after. |
actually, i just had a thought: all spark branches (master, 2.3, 2.4) use the same python2.7 env... will this impact the older branches negatively? i can (and will) test to confirm. |
I forgot about the branches sharing the same env, in that case we definitely don't want to upgrade pyarrow for the 2.3/2.4 branches. I think it's ok if we leave as is, then master will just skip arrow tests for Python 2 and we are still running the same tests with Python 3. |
sgtm++ |
Merged to master. |
Thanks all! I'll keep an eye out on Jenkins and make sure this is running ok. |
this build was triggered by the merge: all the python tests passed, but holy CRAP the "skipped tests" output is super verbose and could use a for-serious refactor. |
…2.1 and remove prior workarounds ## What changes were proposed in this pull request? This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. ## How was this patch tested? Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…2.1 and remove prior workarounds ## What changes were proposed in this pull request? This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. ## How was this patch tested? Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…2.1 and remove prior workarounds ## What changes were proposed in this pull request? This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. ## How was this patch tested? Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…2.1 and remove prior workarounds This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
* [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * Fix pandas infer_dtype warning * [SPARK-27276][PYTHON][DOCS][FOLLOW-UP] Update documentation about Arrow version in PySpark as well ## What changes were proposed in this pull request? Looks updating documentation from 0.8.0 to 0.12.1 was missed. ## How was this patch tested? N/A Closes apache#24504 from HyukjinKwon/SPARK-27276-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> Co-authored-by: Bryan Cutler <cutlerb@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org>
…2.1 and remove prior workarounds This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes apache#24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an
ImportError
will be raised to indicate an upgrade is needed.How was this patch tested?
Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0