[SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch #39952

EnricoMi · 2023-02-09T11:02:24Z

What changes were proposed in this pull request?

Similar to #38223, improve the error messages when a Python method provided to DataFrame.mapInPandas returns a Pandas DataFrame that does not match the expected schema.

With

df = spark.range(2).withColumn("v", col("id"))

Mismatching column names:

df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
#      Missing: val  Unexpected: v

Python function not returning iterator:

df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>

Python function not returning iterator of pandas.DataFrame:

df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>

Mismatching types (ValueError and TypeError):

df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
#      The above exception was the direct cause of the following exception:
#      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).

df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
#      The above exception was the direct cause of the following exception:
#      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).

with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
  df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.

Why are the changes needed?

Existing errors are generic (KeyError) or meaningless ('int' object is not iterable). The errors should help users in spotting the mismatching columns by naming them.

The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.

Does this PR introduce any user-facing change?

This only changes error messages, not behaviour.

How was this patch tested?

Tests all cases of schema mismatch for DataFrame.mapInPandas.

EnricoMi · 2023-02-09T14:04:32Z

@HyukjinKwon this is a follow-up to #38223

EnricoMi · 2023-02-22T09:38:06Z

@HyukjinKwon @cloud-fan would you say Dataset.mapInPandas should go on a par with improved error messages of Dataset.groupby(...).applyInPandas in the same Spark release (that would be 3.4.0)?

EnricoMi · 2023-03-13T08:56:13Z

CC @cloud-fan @itholic @zhengruifeng

EnricoMi · 2023-03-21T11:36:20Z

CC @gatorsmile @xinrong-meng

MaxGekk

@HyukjinKwon @ueshin @itholic Could you have a look at the PR.

itholic · 2023-06-29T17:27:39Z

Could you rebase this PR to master? It seems like there are some conflicts from master and yours.
https://github.com/G-Research/spark/runs/13927060744

From https://github.com/G-Research/spark
 * branch                  branch-pyspark-map-in-pandas-schema-mismatch -> FETCH_HEAD
Auto-merging python/pyspark/pandas/frame.py
Auto-merging python/pyspark/sql/pandas/serializers.py
Auto-merging python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py
Auto-merging python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py
Auto-merging python/pyspark/sql/tests/pandas/test_pandas_map.py
CONFLICT (content): Merge conflict in python/pyspark/sql/tests/pandas/test_pandas_map.py
Auto-merging python/pyspark/sql/tests/test_arrow_map.py
Squash commit -- not updating HEAD
Automatic merge failed; fix conflicts and then commit the result.
Error: Process completed with exit code 1.

itholic · 2023-06-29T17:31:20Z

python/pyspark/worker.py

Can we raise PySparkTypeError instead of TypeError?

python/pyspark/worker.py

HyukjinKwon · 2023-07-03T07:45:38Z

@xinrong-meng I think you should take a look at this.

xinrong-meng · 2023-07-05T22:13:52Z

Thanks @EnricoMi !
I would suggest creating a separate def wrap_.. for PythonEvalType.SQL_MAP_ARROW_ITER_UDF instead of introducing a new parameter is_arrow_iter to wrap_batch_iter_udf.
That maintains logical consistency with the other wrap_ functions and promotes a modular design.
My point is subject to debate.

EnricoMi · 2023-07-10T09:37:29Z

@xinrong-meng split wrap_batch_iter_udf into wrap_pandas_batch_iter_udf and wrap_arrow_batch_iter_udf: 725c3af

xinrong-meng · 2023-07-10T23:31:03Z

The refactoring is neat and clean! Would you fix the CI test failure?

EnricoMi · 2023-07-11T07:28:58Z

Not sure how to fix the Python code generation check: https://github.com/G-Research/spark/actions/runs/5516480294/jobs/10057925480#step:18:101

xinrong-meng · 2023-07-12T01:13:46Z

Would you try the command "dev/connect-gen-protos.sh"?

…row_batch_iter_udf

EnricoMi · 2023-07-12T14:58:46Z

Running dev/connect-gen-protos.sh showed the same error. Rebasing with latest master fixed the issue.

xinrong-meng · 2023-07-14T03:17:25Z

The last commit seems to fail the tests. Would you fix it?

EnricoMi · 2023-07-14T08:40:36Z

All green, all done.

xinrong-meng · 2023-07-18T00:09:13Z

Merged to master, thanks!

allisonwang-db · 2023-08-02T21:21:15Z

@xinrong-meng @EnricoMi should we also merge this in branch-3.5?

HyukjinKwon · 2023-08-03T03:09:19Z

I am fine with merging it to 3.5.

EnricoMi · 2023-08-03T07:18:31Z

Yes, please!

EnricoMi · 2023-08-03T07:27:49Z

Merge PR for branch 3.5 in #42316.

…InPandas for schema mismatch ### What changes were proposed in this pull request? This merges #39952 into 3.5 branch. Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema. With ```Python df = spark.range(2).withColumn("v", col("id")) ``` **Mismatching column names:** ```Python df.mapInPandas(lambda it: it, "id long, val long").show() # was: KeyError: 'val' # now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema. # Missing: val Unexpected: v ``` **Python function not returning iterator:** ```Python df.mapInPandas(lambda it: 1, "id long").show() # was: TypeError: 'int' object is not iterable # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'> ``` **Python function not returning iterator of pandas.DataFrame:** ```Python df.mapInPandas(lambda it: [1], "id long").show() # was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'> # sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'> ``` **Mismatching types (ValueError and TypeError):** ```Python df.mapInPandas(lambda it: it, "id int, v string").show() # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # The above exception was the direct cause of the following exception: # TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string). df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # The above exception was the direct cause of the following exception: # ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}): df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. # now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. ``` ### Why are the changes needed? Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them. The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive. ### Does this PR introduce _any_ user-facing change? This only changes error messages, not behaviour. ### How was this patch tested? Tests all cases of schema mismatch for `DataFrame.mapInPandas`. Closes #42316 from EnricoMi/branch-pyspark-map-in-pandas-schema-mismatch-3.5. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE PANDAS API ON SPARK PYTHON SQL labels Feb 9, 2023

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 562ba0b to 1f65f7e Compare February 28, 2023 11:04

github-actions bot added the CONNECT label Feb 28, 2023

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from e4427f8 to b8994c2 Compare May 24, 2023 09:09

MaxGekk reviewed Jun 28, 2023

View reviewed changes

itholic reviewed Jun 29, 2023

View reviewed changes

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 524afe8 to 8fb5496 Compare June 30, 2023 09:32

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 8fb5496 to 99bd1f2 Compare July 10, 2023 09:35

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 09a6a71 to 393226a Compare July 10, 2023 18:52

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 393226a to 3145854 Compare July 11, 2023 06:16

EnricoMi added 6 commits July 12, 2023 16:56

Correct type error message for mapInArrows, check iterator type

9b1ce07

Improve error messages for applyInPandas

c4d4333

Rename Pandas.DataFrame in strings and docstrings

15a1379

Remove redundant .toDF from pandas tests

65459ed

DataFrame.mapInPandas allows for extra columns

6d81775

Reformatting Python

2fda259

EnricoMi added 10 commits July 12, 2023 16:56

Make mapInPandas work with iterables again

8f290cb

Fixing Python lints

f14d8e6

Assert actual element type, not __len__ attribute

b3caabb

Remove QuietTest from MapInPandasParityTests, skip failing test

d095274

Fix test_other_than_recordbatch_iter in ArrowMapParityTests

b8b7da1

Split wrap_batch_iter_udf into wrap_pandas_batch_iter_udf and wrap_ar…

291fe79

…row_batch_iter_udf

Really test with empty dataframe

696dc3f

Fix pandas map tests

962a943

Use PySparkTypeError instead of TypeError

579a473

Fix lint

7328294

EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 3145854 to 7328294 Compare July 12, 2023 14:57

Merge branch 'master' into branch-pyspark-map-in-pandas-schema-mismatch

bd470d7

xinrong-meng approved these changes Jul 14, 2023

View reviewed changes

xinrong-meng self-requested a review July 14, 2023 03:16

Add trucate_return_schema to wrap_arrow_udtf

6dd87ec

xinrong-meng approved these changes Jul 18, 2023

View reviewed changes

xinrong-meng closed this in a367fde Jul 18, 2023

EnricoMi mentioned this pull request Aug 3, 2023

[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316

Closed

[SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch #39952

[SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch #39952

Uh oh!

Conversation

EnricoMi commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

EnricoMi commented Feb 9, 2023

Uh oh!

EnricoMi commented Feb 22, 2023

Uh oh!

EnricoMi commented Mar 13, 2023

Uh oh!

EnricoMi commented Mar 21, 2023

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

itholic commented Jun 29, 2023

Uh oh!

itholic Jun 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented Jul 3, 2023

Uh oh!

xinrong-meng commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EnricoMi commented Jul 10, 2023

Uh oh!

xinrong-meng commented Jul 10, 2023

Uh oh!

EnricoMi commented Jul 11, 2023

Uh oh!

xinrong-meng commented Jul 12, 2023

Uh oh!

EnricoMi commented Jul 12, 2023

Uh oh!

xinrong-meng commented Jul 14, 2023

Uh oh!

EnricoMi commented Jul 14, 2023

Uh oh!

xinrong-meng commented Jul 18, 2023

Uh oh!

allisonwang-db commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 3, 2023

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

EnricoMi commented Feb 9, 2023 •

edited

Loading

xinrong-meng commented Jul 5, 2023 •

edited

Loading

allisonwang-db commented Aug 2, 2023 •

edited

Loading