-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ #26496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ #26496
Conversation
|
WIP still need to do:
|
|
Test build #113676 has finished for PR 26496 at commit
|
|
Test build #113819 has finished for PR 26496 at commit
|
|
ping @HyukjinKwon @viirya for thoughts on the proposed implementation. For testing of Python 2.7, I think we will have to set the env variable to use LegacyRow by default, otherwise there is lots of code to change. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late response. +1 from me.
|
|
||
| if kwargs: | ||
| # create row objects | ||
| names = sorted(kwargs.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after a second thought, why don't we just have an env to switch on and off the sorting, and disable it in Spark 3.0, and remove the env out in Spark 3.1? I think it will need less changes I suspect (rather than having a separate class for legacy row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could do that but that doesn't solve the problem of the __from_dict__ flag that is not needed if there is no sorting. That flag isn't serialized which causes different behavior when serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually it looks like it could be possible to only add the __from_dict__ flag if sorting is enabled too. I can give that a try and see if it works, wdyt?
|
Ok, I changed to @HyukjinKwon suggestion of using the env var to control sorting only, and also creation of the I think this is the most gentle approach to introducing new behavior (of not sorting) and still allowing users to create legacy Rows that are sorted, if needed. It will also be a fairly clean removal when Python < 3.6 is deprecated. I did have to set the env var by default for tests, but I introduced a new test that makes an unsorted Row. This can also be changed once we deprecate Python 2 and remove from testing. Let me know if there is any issue with the current implementation, and I will add to the migration guide as soon as I can. |
|
Test build #114803 has finished for PR 26496 at commit
|
|
Test build #114804 has finished for PR 26496 at commit
|
python/pyspark/sql/types.py
Outdated
| to "true". This option is deprecated and will be removed in future versions | ||
| of Spark. For Python versions < 3.6, named arguments can no longer be used | ||
| without enabling field sorting with the environment variable above because | ||
| order or the arguments is not guaranteed to be the same as entered, see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? or -> of?
python/pyspark/sql/types.py
Outdated
| if not Row._row_field_sorting_enabled and sys.version_info[:2] < (3, 6): | ||
| warnings.warn("To use named arguments for Python version < 3.6, Row " | ||
| "field sorting must be enabled by setting the environment " | ||
| "variable 'PYSPARK_ROW_FIELD_SORTING_ENABLED' to 'true'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to say we enable it automatically now.
| _make_type_verifier(data_type, nullable=False)(obj) | ||
|
|
||
| @unittest.skipIf(sys.version_info[:2] < (3, 6), "Create Row without sorting fields") | ||
| def test_Row_without_field_sorting(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no big deal but can we test_Row_without_field_sorting -> test_row_without_field_sorting? Strictly it follows pep8 I guess (and I personally don't like such names in the current codebase ... )
python/pyspark/sql/types.py
Outdated
| NOTE: As of Spark 3.0.0, the Row field names are no longer sorted | ||
| alphabetically. To enable field sorting to create Rows compatible with | ||
| Spark 2.x, set the environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious when this compatibility will be matter? If using Python >= 3.6 at Spark 3.0.0, do users need this compatibility? Or this is just for Python < 3.6 users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good question. I think it's possible for even users that have Python 3.6 to have code that relied on the field names being sorted and this will break their existing code. So they still might need to set the env var until the existing code could be updated. Not sure how likely this scenario is...
|
Test build #114814 has finished for PR 26496 at commit
|
|
retest this please |
|
Test build #115787 has finished for PR 26496 at commit
|
d2f0bed to
3a69539
Compare
|
Apologies for the delay, but I've updated with a note in the migration guide and rebased, and removed the WIP. Please take another look @HyukjinKwon @viirya , thanks! |
|
Test build #116194 has finished for PR 26496 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay though I haven;t taken a super close look. Should be good to go @BryanCutler if you feel sure. I will take a closer look otherwise in some days.
|
Thanks @HyukjinKwon , I'll give it a few days for any more comments. I'm not crazy about python <3.6 defaulting to the old behavior, but soon we will drop those and it won't be an issue anymore. |
|
Test build #116430 has finished for PR 26496 at commit
|
|
merged to master, thanks @HyukjinKwon and @viirya |
| - Since Spark 3.0, `Column.getItem` is fixed such that it does not call `Column.apply`. Consequently, if `Column` is used as an argument to `getItem`, the indexing operator should be used. | ||
| For example, `map_col.getItem(col('id'))` should be replaced with `map_col[col('id')]`. | ||
|
|
||
| - As of Spark 3.0 `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to "true". For Python versions less than 3.6, the field names will be sorted alphabetically as the only option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we mention that this must be set for all processes? For example, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLEDto "true" for **executors and driver**. This env must be consistent on all executors and driver. Any inconsistency may cause failures or incorrect answers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Let me fix it.
… variable to set in both executor and driver ### What changes were proposed in this pull request? This PR address the comment at #26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver. ### Why are the changes needed? To clarify this env should be set both in driver and executors. ### Does this PR introduce any user-facing change? Nope. ### How was this patch tested? I checked it via md editor. Closes #27573 from HyukjinKwon/SPARK-29748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
… variable to set in both executor and driver ### What changes were proposed in this pull request? This PR address the comment at #26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver. ### Why are the changes needed? To clarify this env should be set both in driver and executors. ### Does this PR introduce any user-facing change? Nope. ### How was this patch tested? I checked it via md editor. Closes #27573 from HyukjinKwon/SPARK-29748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit b343757) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
… variable to set in both executor and driver ### What changes were proposed in this pull request? This PR address the comment at apache#26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver. ### Why are the changes needed? To clarify this env should be set both in driver and executors. ### Does this PR introduce any user-facing change? Nope. ### How was this patch tested? I checked it via md editor. Closes apache#27573 from HyukjinKwon/SPARK-29748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
What changes were proposed in this pull request?
Removing the sorting of PySpark SQL Row fields that were previously sorted by name alphabetically for Python versions 3.6 and above. Field order will now match that as entered. Rows will be used like tuples and are applied to schema by position. For Python versions < 3.6, the order of kwargs is not guaranteed and therefore will be sorted automatically as in previous versions of Spark.
Why are the changes needed?
This caused inconsistent behavior in that local Rows could be applied to a schema by matching names, but once serialized the Row could only be used by position and the fields were possibly in a different order.
Does this PR introduce any user-facing change?
Yes, Row fields are no longer sorted alphabetically but will be in the order entered. For Python < 3.6
kwargscan not guarantee the order as entered, soRows will be automatically sorted.An environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" can be set that will override construction of
Rowto maintain compatibility with Spark 2.x.How was this patch tested?
Existing tests are run with PYSPARK_ROW_FIELD_SORTING_ENABLED=true and added new test with unsorted fields for Python 3.6+