Skip to content

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Aug 3, 2023

What changes were proposed in this pull request?

This merges #39952 into 3.5 branch.

Similar to #38223, improve the error messages when a Python method provided to DataFrame.mapInPandas returns a Pandas DataFrame that does not match the expected schema.

With

df = spark.range(2).withColumn("v", col("id"))

Mismatching column names:

df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
#      Missing: val  Unexpected: v

Python function not returning iterator:

df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>

Python function not returning iterator of pandas.DataFrame:

df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>

Mismatching types (ValueError and TypeError):

df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
#      The above exception was the direct cause of the following exception:
#      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).

df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
#      The above exception was the direct cause of the following exception:
#      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).

with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
  df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.

Why are the changes needed?

Existing errors are generic (KeyError) or meaningless ('int' object is not iterable). The errors should help users in spotting the mismatching columns by naming them.

The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.

Does this PR introduce any user-facing change?

This only changes error messages, not behaviour.

How was this patch tested?

Tests all cases of schema mismatch for DataFrame.mapInPandas.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Aug 3, 2023

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Aug 3, 2023

Thanks!

@HyukjinKwon
Copy link
Member

Merged to branch-3.5.

@HyukjinKwon HyukjinKwon closed this Aug 4, 2023
HyukjinKwon pushed a commit that referenced this pull request Aug 4, 2023
…InPandas for schema mismatch

### What changes were proposed in this pull request?
This merges #39952 into 3.5 branch.

Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema.

With
```Python
df = spark.range(2).withColumn("v", col("id"))
```

**Mismatching column names:**
```Python
df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
#      Missing: val  Unexpected: v
```

**Python function not returning iterator:**
```Python
df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>
```

**Python function not returning iterator of pandas.DataFrame:**
```Python
df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>
```

**Mismatching types (ValueError and TypeError):**
```Python
df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
#      The above exception was the direct cause of the following exception:
#      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).

df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
#      The above exception was the direct cause of the following exception:
#      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).

with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
  df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
#      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
#      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
```

### Why are the changes needed?
Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them.

The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.

### Does this PR introduce _any_ user-facing change?
This only changes error messages, not behaviour.

### How was this patch tested?
Tests all cases of schema mismatch for `DataFrame.mapInPandas`.

Closes #42316 from EnricoMi/branch-pyspark-map-in-pandas-schema-mismatch-3.5.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants