Skip to content

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Feb 9, 2023

What changes were proposed in this pull request?

Similar to #38223, improve the error messages when a Python method provided to DataFrame.mapInPandas returns a Pandas DataFrame that does not match the expected schema.

With

df = spark.range(2).withColumn("v", col("id"))

Mismatching column names:

df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
#      Missing: val  Unexpected: v

Python function not returning iterator:

df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>

Python function not returning iterator of pandas.DataFrame:

df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>

Mismatching types (ValueError and TypeError):

df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
#      The above exception was the direct cause of the following exception:
#      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).

df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
#      The above exception was the direct cause of the following exception:
#      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).

with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
  df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.

Why are the changes needed?

Existing errors are generic (KeyError) or meaningless ('int' object is not iterable). The errors should help users in spotting the mismatching columns by naming them.

The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.

Does this PR introduce any user-facing change?

This only changes error messages, not behaviour.

How was this patch tested?

Tests all cases of schema mismatch for DataFrame.mapInPandas.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Feb 9, 2023

@HyukjinKwon this is a follow-up to #38223

@EnricoMi
Copy link
Contributor Author

@HyukjinKwon @cloud-fan would you say Dataset.mapInPandas should go on a par with improved error messages of Dataset.groupby(...).applyInPandas in the same Spark release (that would be 3.4.0)?

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 562ba0b to 1f65f7e Compare February 28, 2023 11:04
@EnricoMi
Copy link
Contributor Author

CC @cloud-fan @itholic @zhengruifeng

@EnricoMi
Copy link
Contributor Author

CC @gatorsmile @xinrong-meng

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from e4427f8 to b8994c2 Compare May 24, 2023 09:09
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @ueshin @itholic Could you have a look at the PR.

@itholic
Copy link
Contributor

itholic commented Jun 29, 2023

Could you rebase this PR to master? It seems like there are some conflicts from master and yours.
https://github.com/G-Research/spark/runs/13927060744

From https://github.com/G-Research/spark
 * branch                  branch-pyspark-map-in-pandas-schema-mismatch -> FETCH_HEAD
Auto-merging python/pyspark/pandas/frame.py
Auto-merging python/pyspark/sql/pandas/serializers.py
Auto-merging python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py
Auto-merging python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py
Auto-merging python/pyspark/sql/tests/pandas/test_pandas_map.py
CONFLICT (content): Merge conflict in python/pyspark/sql/tests/pandas/test_pandas_map.py
Auto-merging python/pyspark/sql/tests/test_arrow_map.py
Squash commit -- not updating HEAD
Automatic merge failed; fix conflicts and then commit the result.
Error: Process completed with exit code 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise PySparkTypeError instead of TypeError?

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 524afe8 to 8fb5496 Compare June 30, 2023 09:32
@HyukjinKwon
Copy link
Member

@xinrong-meng I think you should take a look at this.

@xinrong-meng
Copy link
Member

xinrong-meng commented Jul 5, 2023

Thanks @EnricoMi !
I would suggest creating a separate def wrap_.. for PythonEvalType.SQL_MAP_ARROW_ITER_UDF instead of introducing a new parameter is_arrow_iter to wrap_batch_iter_udf.
That maintains logical consistency with the other wrap_ functions and promotes a modular design.
My point is subject to debate.

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 8fb5496 to 99bd1f2 Compare July 10, 2023 09:35
@EnricoMi
Copy link
Contributor Author

@xinrong-meng split wrap_batch_iter_udf into wrap_pandas_batch_iter_udf and wrap_arrow_batch_iter_udf: 725c3af

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 09a6a71 to 393226a Compare July 10, 2023 18:52
@xinrong-meng
Copy link
Member

The refactoring is neat and clean! Would you fix the CI test failure?

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 393226a to 3145854 Compare July 11, 2023 06:16
@EnricoMi
Copy link
Contributor Author

Not sure how to fix the Python code generation check: https://github.com/G-Research/spark/actions/runs/5516480294/jobs/10057925480#step:18:101

@xinrong-meng
Copy link
Member

Would you try the command "dev/connect-gen-protos.sh"?

@EnricoMi EnricoMi force-pushed the branch-pyspark-map-in-pandas-schema-mismatch branch from 3145854 to 7328294 Compare July 12, 2023 14:57
@EnricoMi
Copy link
Contributor Author

Running dev/connect-gen-protos.sh showed the same error. Rebasing with latest master fixed the issue.

@xinrong-meng xinrong-meng self-requested a review July 14, 2023 03:16
@xinrong-meng
Copy link
Member

The last commit seems to fail the tests. Would you fix it?

@EnricoMi
Copy link
Contributor Author

All green, all done.

@xinrong-meng
Copy link
Member

Merged to master, thanks!

@allisonwang-db
Copy link
Contributor

allisonwang-db commented Aug 2, 2023

@xinrong-meng @EnricoMi should we also merge this in branch-3.5?

@HyukjinKwon
Copy link
Member

I am fine with merging it to 3.5.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Aug 3, 2023

Yes, please!

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Aug 3, 2023

Merge PR for branch 3.5 in #42316.

HyukjinKwon pushed a commit that referenced this pull request Aug 4, 2023
…InPandas for schema mismatch

### What changes were proposed in this pull request?
This merges #39952 into 3.5 branch.

Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema.

With
```Python
df = spark.range(2).withColumn("v", col("id"))
```

**Mismatching column names:**
```Python
df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
#      Missing: val  Unexpected: v
```

**Python function not returning iterator:**
```Python
df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>
```

**Python function not returning iterator of pandas.DataFrame:**
```Python
df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>
```

**Mismatching types (ValueError and TypeError):**
```Python
df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
#      The above exception was the direct cause of the following exception:
#      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).

df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
#      The above exception was the direct cause of the following exception:
#      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).

with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
  df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
```

### Why are the changes needed?
Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them.

The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.

### Does this PR introduce _any_ user-facing change?
This only changes error messages, not behaviour.

### How was this patch tested?
Tests all cases of schema mismatch for `DataFrame.mapInPandas`.

Closes #42316 from EnricoMi/branch-pyspark-map-in-pandas-schema-mismatch-3.5.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants