Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Nov 14, 2024

What changes were proposed in this pull request?

The PR targets at Spark Connect only. Spark Classic has been handled in #48677.

verifySchema parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema.

Now it's not supported on Spark Connect.

The PR proposes to support verifySchema on Spark Connect.

By default, verifySchema parameter is pyspark._NoValue, if not provided, createDataFrame with

  • pyarrow.Table, verifySchema = False
  • pandas.DataFrame with Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely
  • regular Python instances, verifySchema = True

The schema enforcement of numpy ndarray input is unexpected and will be resolved as a follow-up, https://issues.apache.org/jira/browse/SPARK-50323.

Why are the changes needed?

Parity with Spark Classic.

Does this PR introduce any user-facing change?

Yes, verifySchema parameter of createDataFrame is supported in Spark Connect.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect [SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect Nov 15, 2024
@xinrong-meng xinrong-meng changed the title [SPARK-50298][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect [SPARK-50298][PYTHON}[CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect Nov 15, 2024
@xinrong-meng xinrong-meng changed the title [SPARK-50298][PYTHON}[CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect [SPARK-50298][PYTHON][CONNECT] Implement verifySchema parameter of createDataFrame in Spark Connect Nov 15, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review November 15, 2024 04:19
@xinrong-meng
Copy link
Member Author

Merged to master, thank you!

@HyukjinKwon
Copy link
Member

Actually had the offline discussion. I think we should evaluate the performance impact, and think about deprecating this if this isn't really useful instead of propagating it.

Let me revert #48677 and #48841 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants