[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

BryanCutler · 2019-04-05T22:32:37Z

What changes were proposed in this pull request?

Running PySpark tests with Pandas 0.24.x causes a failure in test_pandas_udf_grouped_map test_supported_types:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This is because a column is an ArrayType and the method sqlutils ReusedSQLTestCase.assertPandasEqual does not properly check this.

This PR removes assertPandasEqual and replaces it with the built-in pandas.util.testing.assert_frame_equal which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.

Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.

How was this patch tested?

Existing tests

…assert_frame_equal

BryanCutler · 2019-04-05T22:35:26Z

I went ahead and did this before #24298 just in case we end up testing with a newer version of Pandas that also causes the same error as above. cc @HyukjinKwon @ueshin

BryanCutler · 2019-04-05T22:38:42Z

python/pyspark/sql/tests/test_pandas_udf_grouped_map.py

+    """
+    import sys
+    if sys.version < '3':
+        pd_assert_frame_equal(left, right, check_column_type=False)


This was the only surprise, and it's because calling DataFrame.apply with kwargs infers the column names as str for Python 2, and causes DataFrame.columns.inferred_type to be 'mixed' and the assert to fail

Should we let all tests using this wrapped version? I think it's better if we use a consistent version of assert_frame_equal.

I don't know, the problem only comes up in these tests because of the way they call assign. How about I remove this function wrapping and just make the option conditional for Python version?

Ok. Sounds good. Thanks.

SparkQA · 2019-04-05T22:54:02Z

Test build #104333 has finished for PR 24306 at commit 26eaa1a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-04-05T23:18:20Z

retest this please

SparkQA · 2019-04-05T23:37:26Z

Test build #104336 has finished for PR 24306 at commit 26eaa1a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Look good pending tests.

HyukjinKwon · 2019-04-06T13:11:17Z

Seems the test failure is related with the Python version upgrade ..

dongjoon-hyun · 2019-04-06T18:37:21Z

Yes. It does.

dongjoon-hyun · 2019-04-06T18:37:27Z

Retest this please

SparkQA · 2019-04-06T18:57:33Z

Test build #104346 has finished for PR 24306 at commit 26eaa1a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T19:46:43Z

Test build #104403 has finished for PR 24306 at commit 1361382.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T20:44:14Z

Test build #104404 has finished for PR 24306 at commit 7cdcb9f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-04-08T20:49:09Z

retest this please

SparkQA · 2019-04-08T21:05:02Z

Test build #104405 has finished for PR 24306 at commit 7cdcb9f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T21:27:17Z

Test build #104406 has finished for PR 24306 at commit 1c5452a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T22:35:20Z

Test build #104407 has finished for PR 24306 at commit 224063c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T23:34:09Z

Test build #104411 has finished for PR 24306 at commit 26eaa1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-04-09T14:23:22Z

python/pyspark/sql/tests/test_pandas_udf_grouped_map.py

+    import pyarrow as pa
+
+
+def assert_frame_equal(left, right):


This is more for my info, but what happens here if you dont' have pandas? how can this method work?
If you do have pandas elsewhere, do the other imports shadow this definition?

This would fail if you don't have pandas with an error that it doesn't know what pd_assert_frame_equal is, since it's only imported above, conditional on pandas being imported already. This is only seen by this file and wouldn't change anything elsewhere.

I'm going to remove this though, and only define check_column_type here. That should make it clearer I think.

…stead

SparkQA · 2019-04-09T17:21:21Z

Test build #104438 has finished for PR 24306 at commit 8dd7202.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-04-09T17:28:54Z

retest this please

SparkQA · 2019-04-09T17:56:05Z

Test build #104445 has finished for PR 24306 at commit 8dd7202.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-04-09T18:02:07Z

retest this please

…

On Tue, Apr 9, 2019, 10:58 AM UCB AMPLab ***@***.***> wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/104445/ Test FAILed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#24306 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUwdck9Td1Sl031HyZIaXBireWmqjJjks5vfNTAgaJpZM4cf1Iv> .

BryanCutler · 2019-04-09T21:39:51Z

retest this please

SparkQA · 2019-04-09T22:18:51Z

Test build #104454 has finished for PR 24306 at commit 8dd7202.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-04-09T22:49:41Z

Merged to master.

…Pandas assert_frame_equals ## What changes were proposed in this pull request? Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this. This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs. Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times. ## How was this patch tested? Existing tests Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Replaced testing sqlutils.assertPandasEqual with pandas.util.testing.…

26eaa1a

…assert_frame_equal

BryanCutler commented Apr 5, 2019

View reviewed changes

HyukjinKwon approved these changes Apr 6, 2019

View reviewed changes

BryanCutler force-pushed the python-pandas-assert_frame_equal-SPARK-27387 branch from 1361382 to 7cdcb9f Compare April 8, 2019 20:23

BryanCutler force-pushed the python-pandas-assert_frame_equal-SPARK-27387 branch from 224063c to 26eaa1a Compare April 8, 2019 22:59

viirya approved these changes Apr 9, 2019

View reviewed changes

srowen reviewed Apr 9, 2019

View reviewed changes

removed wrapping function and use option conditional on py version in…

8dd7202

…stead

HyukjinKwon closed this in f62f44f Apr 9, 2019

BryanCutler mentioned this pull request Apr 11, 2019

[SPARK-25079][python] update python3 executable to 3.6.x #24266

Closed

rshkv mentioned this pull request Feb 28, 2020

Bump pyarrow to 0.12.1 palantir/spark#649

Merged

rshkv mentioned this pull request May 1, 2020

Upgrade Docker image and pyarrow in it palantir/spark#677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

BryanCutler commented Apr 5, 2019 •

edited

Loading

BryanCutler commented Apr 5, 2019

BryanCutler Apr 5, 2019

viirya Apr 9, 2019 •

edited

Loading

BryanCutler Apr 9, 2019

viirya Apr 9, 2019

SparkQA commented Apr 5, 2019

BryanCutler commented Apr 5, 2019

SparkQA commented Apr 5, 2019

HyukjinKwon left a comment

HyukjinKwon commented Apr 6, 2019

dongjoon-hyun commented Apr 6, 2019

dongjoon-hyun commented Apr 6, 2019

SparkQA commented Apr 6, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

HyukjinKwon commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

srowen Apr 9, 2019

BryanCutler Apr 9, 2019

SparkQA commented Apr 9, 2019

BryanCutler commented Apr 9, 2019

SparkQA commented Apr 9, 2019

BryanCutler commented Apr 9, 2019 via email

BryanCutler commented Apr 9, 2019

SparkQA commented Apr 9, 2019

HyukjinKwon commented Apr 9, 2019

[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

Conversation

BryanCutler commented Apr 5, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

BryanCutler commented Apr 5, 2019

BryanCutler Apr 5, 2019

Choose a reason for hiding this comment

viirya Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

BryanCutler Apr 9, 2019

Choose a reason for hiding this comment

viirya Apr 9, 2019

Choose a reason for hiding this comment

SparkQA commented Apr 5, 2019

BryanCutler commented Apr 5, 2019

SparkQA commented Apr 5, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 6, 2019

dongjoon-hyun commented Apr 6, 2019

dongjoon-hyun commented Apr 6, 2019

SparkQA commented Apr 6, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

HyukjinKwon commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

srowen Apr 9, 2019

Choose a reason for hiding this comment

BryanCutler Apr 9, 2019

Choose a reason for hiding this comment

SparkQA commented Apr 9, 2019

BryanCutler commented Apr 9, 2019

SparkQA commented Apr 9, 2019

BryanCutler commented Apr 9, 2019 via email

BryanCutler commented Apr 9, 2019

SparkQA commented Apr 9, 2019

HyukjinKwon commented Apr 9, 2019

BryanCutler commented Apr 5, 2019 •

edited

Loading

viirya Apr 9, 2019 •

edited

Loading