-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50291][PYTHON] Standardize verifySchema parameter of createDataFrame in Spark Classic #48677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/session.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to mention this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say that this parameter is now respected in Spark Connect and with Arrow optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Removed.
6f411f9 to
b707d2a
Compare
|
Type hints failed weirdly as https://github.com/xinrong-meng/spark/actions/runs/11812734311/job/32908478871, ignoring |
|
Merged to master. |
…eateDataFrame in Spark Connect ### What changes were proposed in this pull request? The PR targets at Spark Connect only. Spark Classic has been handled in #48677. `verifySchema` parameter of createDataFrame on Spark Classic decides whether to verify data types of every row against schema. Now it's not supported on Spark Connect. The PR proposes to support `verifySchema` on Spark Connect. By default, `verifySchema` parameter is `pyspark._NoValue`, if not provided, createDataFrame with - `pyarrow.Table`, **verifySchema = False** - `pandas.DataFrame` with Arrow optimization, **verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely** - regular Python instances, **verifySchema = True** The schema enforcement of numpy ndarray input is unexpected and will be resolved as a follow-up, https://issues.apache.org/jira/browse/SPARK-50323. ### Why are the changes needed? Parity with Spark Classic. ### Does this PR introduce _any_ user-facing change? Yes, `verifySchema` parameter of createDataFrame is supported in Spark Connect. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48841 from xinrong-meng/verifySchemaConnect. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org>
What changes were proposed in this pull request?
The PR targets at Spark Classic only. Spark Connect will be handled in a follow-up PR.
verifySchemaparameter of createDataFrame decides whether to verify data types of every row against schema.Now it only takes effect for with createDataFrame with
The PR proposes to make it work with createDataFrame with
pyarrow.Tablepandas.DataFramewith Arrow optimizationpandas.DataFramewithout Arrow optimizationBy default,
verifySchemaparameter ispyspark._NoValue, if not provided, createDataFrame withpyarrow.Table, verifySchema = Falsepandas.DataFramewith Arrow optimization, verifySchema = spark.sql.execution.pandas.convertToArrowArraySafelypandas.DataFramewithout Arrow optimization, verifySchema = TrueWhy are the changes needed?
The change makes schema validation consistent across all formats, improving data integrity and helping prevent errors.
It also enhances flexibility by allowing users to choose schema verification regardless of the input type.
Part of SPARK-50146.
Does this PR introduce any user-facing change?
Setup:
Usage - createDataFrame with
pyarrow.TableUsage - createDataFrame with
pandas.DataFramewithout Arrow optimizationUsage - createDataFrame with
pandas.DataFramewith Arrow optimizationHow was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.