-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ #26496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e260920
43ff88d
c4f5bc8
dedb258
cfa1364
294f551
30ec57b
de18dce
93fdd45
af6d1d9
30a3b12
140103b
3a69539
14e691d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,7 @@ | |
| # limitations under the License. | ||
| # | ||
|
|
||
| import os | ||
| import sys | ||
| import decimal | ||
| import time | ||
|
|
@@ -25,6 +26,7 @@ | |
| import base64 | ||
| from array import array | ||
| import ctypes | ||
| import warnings | ||
|
|
||
| if sys.version >= "3": | ||
| long = int | ||
|
|
@@ -1432,10 +1434,23 @@ class Row(tuple): | |
|
|
||
| ``key in row`` will search through row keys. | ||
|
|
||
| Row can be used to create a row object by using named arguments, | ||
| the fields will be sorted by names. It is not allowed to omit | ||
| a named argument to represent the value is None or missing. This should be | ||
| explicitly set to None in this case. | ||
| Row can be used to create a row object by using named arguments. | ||
| It is not allowed to omit a named argument to represent the value is | ||
| None or missing. This should be explicitly set to None in this case. | ||
|
|
||
| NOTE: As of Spark 3.0.0, Rows created from named arguments no longer have | ||
| field names sorted alphabetically and will be ordered in the position as | ||
| entered. To enable sorting for Rows compatible with Spark 2.x, set the | ||
| environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" to "true". This | ||
| option is deprecated and will be removed in future versions of Spark. For | ||
| Python versions < 3.6, the order of named arguments is not guaranteed to | ||
| be the same as entered, see https://www.python.org/dev/peps/pep-0468. In | ||
| this case, a warning will be issued and the Row will fallback to sort the | ||
| field names automatically. | ||
|
|
||
| NOTE: Examples with Row in pydocs are run with the environment variable | ||
| "PYSPARK_ROW_FIELD_SORTING_ENABLED" set to "true" which results in output | ||
| where fields are sorted. | ||
|
|
||
| >>> row = Row(name="Alice", age=11) | ||
| >>> row | ||
|
|
@@ -1474,21 +1489,40 @@ class Row(tuple): | |
| True | ||
| """ | ||
|
|
||
| def __new__(self, *args, **kwargs): | ||
| # Remove after Python < 3.6 dropped, see SPARK-29748 | ||
| _row_field_sorting_enabled = \ | ||
| os.environ.get('PYSPARK_ROW_FIELD_SORTING_ENABLED', 'false').lower() == 'true' | ||
|
|
||
| if _row_field_sorting_enabled: | ||
| warnings.warn("The environment variable 'PYSPARK_ROW_FIELD_SORTING_ENABLED' " | ||
| "is deprecated and will be removed in future versions of Spark") | ||
|
|
||
| def __new__(cls, *args, **kwargs): | ||
| if args and kwargs: | ||
| raise ValueError("Can not use both args " | ||
| "and kwargs to create Row") | ||
| if kwargs: | ||
| if not Row._row_field_sorting_enabled and sys.version_info[:2] < (3, 6): | ||
| warnings.warn("To use named arguments for Python version < 3.6, Row fields will be " | ||
| "automatically sorted. This warning can be skipped by setting the " | ||
| "environment variable 'PYSPARK_ROW_FIELD_SORTING_ENABLED' to 'true'.") | ||
| Row._row_field_sorting_enabled = True | ||
|
|
||
| # create row objects | ||
| names = sorted(kwargs.keys()) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, after a second thought, why don't we just have an env to switch on and off the sorting, and disable it in Spark 3.0, and remove the env out in Spark 3.1? I think it will need less changes I suspect (rather than having a separate class for legacy row)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we could do that but that doesn't solve the problem of the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, actually it looks like it could be possible to only add the |
||
| row = tuple.__new__(self, [kwargs[n] for n in names]) | ||
| row.__fields__ = names | ||
| row.__from_dict__ = True | ||
| return row | ||
| if Row._row_field_sorting_enabled: | ||
| # Remove after Python < 3.6 dropped, see SPARK-29748 | ||
| names = sorted(kwargs.keys()) | ||
| row = tuple.__new__(cls, [kwargs[n] for n in names]) | ||
| row.__fields__ = names | ||
| row.__from_dict__ = True | ||
| else: | ||
| row = tuple.__new__(cls, list(kwargs.values())) | ||
| row.__fields__ = list(kwargs.keys()) | ||
|
|
||
| return row | ||
| else: | ||
| # create row class or objects | ||
| return tuple.__new__(self, args) | ||
| return tuple.__new__(cls, args) | ||
|
|
||
| def asDict(self, recursive=False): | ||
| """ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we mention that this must be set for all processes? For example,
set the environment variablePYSPARK_ROW_FIELD_SORTING_ENABLEDto "true" for **executors and driver**. This env must be consistent on all executors and driver. Any inconsistency may cause failures or incorrect answersThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Let me fix it.