-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306
[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306
Conversation
…assert_frame_equal
I went ahead and did this before #24298 just in case we end up testing with a newer version of Pandas that also causes the same error as above. cc @HyukjinKwon @ueshin |
""" | ||
import sys | ||
if sys.version < '3': | ||
pd_assert_frame_equal(left, right, check_column_type=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only surprise, and it's because calling DataFrame.apply
with kwargs infers the column names as str for Python 2, and causes DataFrame.columns.inferred_type to be 'mixed' and the assert to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we let all tests using this wrapped version? I think it's better if we use a consistent version of assert_frame_equal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, the problem only comes up in these tests because of the way they call assign
. How about I remove this function wrapping and just make the option conditional for Python version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Sounds good. Thanks.
Test build #104333 has finished for PR 24306 at commit
|
retest this please |
Test build #104336 has finished for PR 24306 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good pending tests.
Seems the test failure is related with the Python version upgrade .. |
Yes. It does. |
Retest this please |
Test build #104346 has finished for PR 24306 at commit
|
Test build #104403 has finished for PR 24306 at commit
|
1361382
to
7cdcb9f
Compare
Test build #104404 has finished for PR 24306 at commit
|
retest this please |
Test build #104405 has finished for PR 24306 at commit
|
Test build #104406 has finished for PR 24306 at commit
|
Test build #104407 has finished for PR 24306 at commit
|
224063c
to
26eaa1a
Compare
Test build #104411 has finished for PR 24306 at commit
|
import pyarrow as pa | ||
|
||
|
||
def assert_frame_equal(left, right): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more for my info, but what happens here if you dont' have pandas? how can this method work?
If you do have pandas elsewhere, do the other imports shadow this definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail if you don't have pandas with an error that it doesn't know what pd_assert_frame_equal
is, since it's only imported above, conditional on pandas being imported already. This is only seen by this file and wouldn't change anything elsewhere.
I'm going to remove this though, and only define check_column_type
here. That should make it clearer I think.
Test build #104438 has finished for PR 24306 at commit
|
retest this please |
Test build #104445 has finished for PR 24306 at commit
|
retest this please
…On Tue, Apr 9, 2019, 10:58 AM UCB AMPLab ***@***.***> wrote:
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/104445/
Test FAILed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#24306 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUwdck9Td1Sl031HyZIaXBireWmqjJjks5vfNTAgaJpZM4cf1Iv>
.
|
retest this please |
Test build #104454 has finished for PR 24306 at commit
|
Merged to master. |
…Pandas assert_frame_equals ## What changes were proposed in this pull request? Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this. This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs. Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times. ## How was this patch tested? Existing tests Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…Pandas assert_frame_equals ## What changes were proposed in this pull request? Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this. This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs. Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times. ## How was this patch tested? Existing tests Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…Pandas assert_frame_equals ## What changes were proposed in this pull request? Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this. This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs. Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times. ## How was this patch tested? Existing tests Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…Pandas assert_frame_equals ## What changes were proposed in this pull request? Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this. This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs. Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times. ## How was this patch tested? Existing tests Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Running PySpark tests with Pandas 0.24.x causes a failure in
test_pandas_udf_grouped_map
test_supported_types:ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This is because a column is an ArrayType and the method
sqlutils ReusedSQLTestCase.assertPandasEqual
does not properly check this.This PR removes
assertPandasEqual
and replaces it with the built-inpandas.util.testing.assert_frame_equal
which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.
How was this patch tested?
Existing tests