-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch #39952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch #39952
Conversation
|
@HyukjinKwon this is a follow-up to #38223 |
|
@HyukjinKwon @cloud-fan would you say |
562ba0b to
1f65f7e
Compare
e4427f8 to
b8994c2
Compare
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @ueshin @itholic Could you have a look at the PR.
|
Could you rebase this PR to master? It seems like there are some conflicts from master and yours. |
python/pyspark/worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we raise PySparkTypeError instead of TypeError?
524afe8 to
8fb5496
Compare
|
@xinrong-meng I think you should take a look at this. |
|
Thanks @EnricoMi ! |
8fb5496 to
99bd1f2
Compare
|
@xinrong-meng split |
09a6a71 to
393226a
Compare
|
The refactoring is neat and clean! Would you fix the CI test failure? |
393226a to
3145854
Compare
|
Not sure how to fix the |
|
Would you try the command "dev/connect-gen-protos.sh"? |
…row_batch_iter_udf
3145854 to
7328294
Compare
|
Running |
|
The last commit seems to fail the tests. Would you fix it? |
|
All green, all done. |
|
Merged to master, thanks! |
|
@xinrong-meng @EnricoMi should we also merge this in branch-3.5? |
|
I am fine with merging it to 3.5. |
|
Yes, please! |
|
Merge PR for branch 3.5 in #42316. |
…InPandas for schema mismatch ### What changes were proposed in this pull request? This merges #39952 into 3.5 branch. Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema. With ```Python df = spark.range(2).withColumn("v", col("id")) ``` **Mismatching column names:** ```Python df.mapInPandas(lambda it: it, "id long, val long").show() # was: KeyError: 'val' # now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema. # Missing: val Unexpected: v ``` **Python function not returning iterator:** ```Python df.mapInPandas(lambda it: 1, "id long").show() # was: TypeError: 'int' object is not iterable # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'> ``` **Python function not returning iterator of pandas.DataFrame:** ```Python df.mapInPandas(lambda it: [1], "id long").show() # was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'> # sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'> ``` **Mismatching types (ValueError and TypeError):** ```Python df.mapInPandas(lambda it: it, "id int, v string").show() # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # The above exception was the direct cause of the following exception: # TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string). df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # The above exception was the direct cause of the following exception: # ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}): df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. # now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. ``` ### Why are the changes needed? Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them. The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive. ### Does this PR introduce _any_ user-facing change? This only changes error messages, not behaviour. ### How was this patch tested? Tests all cases of schema mismatch for `DataFrame.mapInPandas`. Closes #42316 from EnricoMi/branch-pyspark-map-in-pandas-schema-mismatch-3.5. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Similar to #38223, improve the error messages when a Python method provided to
DataFrame.mapInPandasreturns a Pandas DataFrame that does not match the expected schema.With
Mismatching column names:
Python function not returning iterator:
Python function not returning iterator of pandas.DataFrame:
Mismatching types (ValueError and TypeError):
Why are the changes needed?
Existing errors are generic (
KeyError) or meaningless ('int' object is not iterable). The errors should help users in spotting the mismatching columns by naming them.The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.
Does this PR introduce any user-facing change?
This only changes error messages, not behaviour.
How was this patch tested?
Tests all cases of schema mismatch for
DataFrame.mapInPandas.