-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
EnricoMi
wants to merge
17
commits into
apache:branch-3.5
from
G-Research:branch-pyspark-map-in-pandas-schema-mismatch-3.5
Closed
[SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch #42316
EnricoMi
wants to merge
17
commits into
apache:branch-3.5
from
G-Research:branch-pyspark-map-in-pandas-schema-mismatch-3.5
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…row_batch_iter_udf
Contributor
Author
HyukjinKwon
approved these changes
Aug 3, 2023
Contributor
Author
|
Thanks! |
Member
|
Merged to branch-3.5. |
HyukjinKwon
pushed a commit
that referenced
this pull request
Aug 4, 2023
…InPandas for schema mismatch ### What changes were proposed in this pull request? This merges #39952 into 3.5 branch. Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema. With ```Python df = spark.range(2).withColumn("v", col("id")) ``` **Mismatching column names:** ```Python df.mapInPandas(lambda it: it, "id long, val long").show() # was: KeyError: 'val' # now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema. # Missing: val Unexpected: v ``` **Python function not returning iterator:** ```Python df.mapInPandas(lambda it: 1, "id long").show() # was: TypeError: 'int' object is not iterable # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'> ``` **Python function not returning iterator of pandas.DataFrame:** ```Python df.mapInPandas(lambda it: [1], "id long").show() # was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'> # sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'> # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'> ``` **Mismatching types (ValueError and TypeError):** ```Python df.mapInPandas(lambda it: it, "id int, v string").show() # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 # The above exception was the direct cause of the following exception: # TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string). df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double # The above exception was the direct cause of the following exception: # ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}): df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show() # was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. # now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double). # It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled # by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`. ``` ### Why are the changes needed? Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them. The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive. ### Does this PR introduce _any_ user-facing change? This only changes error messages, not behaviour. ### How was this patch tested? Tests all cases of schema mismatch for `DataFrame.mapInPandas`. Closes #42316 from EnricoMi/branch-pyspark-map-in-pandas-schema-mismatch-3.5. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This merges #39952 into 3.5 branch.
Similar to #38223, improve the error messages when a Python method provided to
DataFrame.mapInPandasreturns a Pandas DataFrame that does not match the expected schema.With
Mismatching column names:
Python function not returning iterator:
Python function not returning iterator of pandas.DataFrame:
Mismatching types (ValueError and TypeError):
Why are the changes needed?
Existing errors are generic (
KeyError) or meaningless ('int' object is not iterable). The errors should help users in spotting the mismatching columns by naming them.The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.
Does this PR introduce any user-facing change?
This only changes error messages, not behaviour.
How was this patch tested?
Tests all cases of schema mismatch for
DataFrame.mapInPandas.