[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316

EnricoMi · 2023-08-03T07:27:14Z

What changes were proposed in this pull request?

This merges #39952 into 3.5 branch.

Similar to #38223, improve the error messages when a Python method provided to DataFrame.mapInPandas returns a Pandas DataFrame that does not match the expected schema.

With

df = spark.range(2).withColumn("v", col("id"))

Mismatching column names:

df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
#      Missing: val  Unexpected: v

Python function not returning iterator:

df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>

Python function not returning iterator of pandas.DataFrame:

df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>

Mismatching types (ValueError and TypeError):

df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
#      The above exception was the direct cause of the following exception:
#      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).

df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
#      The above exception was the direct cause of the following exception:
#      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).

with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
  df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.

Why are the changes needed?

Existing errors are generic (KeyError) or meaningless ('int' object is not iterable). The errors should help users in spotting the mismatching columns by naming them.

The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.

Does this PR introduce any user-facing change?

This only changes error messages, not behaviour.

How was this patch tested?

Tests all cases of schema mismatch for DataFrame.mapInPandas.

…row_batch_iter_udf

EnricoMi · 2023-08-03T07:28:39Z

@allisonwang-db @xinrong-meng @HyukjinKwon

EnricoMi · 2023-08-03T09:39:29Z

Thanks!

HyukjinKwon · 2023-08-04T01:37:34Z

Merged to branch-3.5.

…InPandas for schema mismatch ### What changes were proposed in this pull request? This merges #39952 into 3.5 branch. Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema. With ```Python df = spark.range(2).withColumn("v", col("id")) ``` **Mismatching column names:** ```Python df.mapInPandas(lambda it: it, "id long, val long").show() # was: KeyError: 'val' # now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema. # Missing: val Unexpected: v ``` **Python function not returning iterator:** ```Python df.mapInPandas(lambda it: 1, "id long").show() # was: TypeError: 'int' object is not iterable # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'> ``` **Python function not returning iterator of pandas.DataFrame:** ```Python df.mapInPandas(lambda it: [1], "id long").show() # was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'> # sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'> ``` **Mismatching types (ValueError and TypeError):** ```Python df.mapInPandas(lambda it: it, "id int, v string").show() # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # The above exception was the direct cause of the following exception: # TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string). df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # The above exception was the direct cause of the following exception: # ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}): df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. # now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. ``` ### Why are the changes needed? Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them. The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive. ### Does this PR introduce _any_ user-facing change? This only changes error messages, not behaviour. ### How was this patch tested? Tests all cases of schema mismatch for `DataFrame.mapInPandas`. Closes #42316 from EnricoMi/branch-pyspark-map-in-pandas-schema-mismatch-3.5. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

EnricoMi added 17 commits August 3, 2023 09:23

Correct type error message for mapInArrows, check iterator type

1b8fbea

Improve error messages for applyInPandas

01a9e59

Rename Pandas.DataFrame in strings and docstrings

0117528

Remove redundant .toDF from pandas tests

a64121f

DataFrame.mapInPandas allows for extra columns

ae5d9fc

Reformatting Python

21f8b53

Make mapInPandas work with iterables again

835a36e

Fixing Python lints

d1890f0

Assert actual element type, not __len__ attribute

e35a7ff

Remove QuietTest from MapInPandasParityTests, skip failing test

ca85664

Fix test_other_than_recordbatch_iter in ArrowMapParityTests

cd5ef19

Split wrap_batch_iter_udf into wrap_pandas_batch_iter_udf and wrap_ar…

e9b018f

…row_batch_iter_udf

Really test with empty dataframe

58f7b58

Fix pandas map tests

a114f3b

Use PySparkTypeError instead of TypeError

2a18221

Fix lint

88bedfc

Add trucate_return_schema to wrap_arrow_udtf

2d67194

github-actions bot added SQL CORE PYTHON PANDAS API ON SPARK CONNECT labels Aug 3, 2023

EnricoMi mentioned this pull request Aug 3, 2023

[SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch #39952

Closed

HyukjinKwon approved these changes Aug 3, 2023

View reviewed changes

HyukjinKwon closed this Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316

[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

HyukjinKwon commented Aug 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316

[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316

Uh oh!

Conversation

EnricoMi commented Aug 3, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

EnricoMi commented Aug 3, 2023

Uh oh!

HyukjinKwon commented Aug 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants