[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ #26496

BryanCutler · 2019-11-13T07:24:23Z

What changes were proposed in this pull request?

Removing the sorting of PySpark SQL Row fields that were previously sorted by name alphabetically for Python versions 3.6 and above. Field order will now match that as entered. Rows will be used like tuples and are applied to schema by position. For Python versions < 3.6, the order of kwargs is not guaranteed and therefore will be sorted automatically as in previous versions of Spark.

Why are the changes needed?

This caused inconsistent behavior in that local Rows could be applied to a schema by matching names, but once serialized the Row could only be used by position and the fields were possibly in a different order.

Does this PR introduce any user-facing change?

Yes, Row fields are no longer sorted alphabetically but will be in the order entered. For Python < 3.6 kwargs can not guarantee the order as entered, so Rows will be automatically sorted.

An environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" can be set that will override construction of Row to maintain compatibility with Spark 2.x.

How was this patch tested?

Existing tests are run with PYSPARK_ROW_FIELD_SORTING_ENABLED=true and added new test with unsorted fields for Python 3.6+

BryanCutler · 2019-11-13T07:25:47Z

WIP still need to do:

Add note to migration guide
Check docs
Fix remaining test failures
Add new tests

SparkQA · 2019-11-13T07:41:36Z

Test build #113676 has finished for PR 26496 at commit ecd8f83.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T23:59:23Z

Test build #113819 has finished for PR 26496 at commit b85c3a6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _LegacyRow(Row):

BryanCutler · 2019-11-19T18:32:25Z

ping @HyukjinKwon @viirya for thoughts on the proposed implementation. For testing of Python 2.7, I think we will have to set the env variable to use LegacyRow by default, otherwise there is lots of code to change.

HyukjinKwon

Sorry for my late response. +1 from me.

HyukjinKwon · 2019-12-02T01:40:49Z

python/pyspark/sql/types.py

+
        if kwargs:
            # create row objects
-            names = sorted(kwargs.keys())


Actually, after a second thought, why don't we just have an env to switch on and off the sorting, and disable it in Spark 3.0, and remove the env out in Spark 3.1? I think it will need less changes I suspect (rather than having a separate class for legacy row)

Yeah, we could do that but that doesn't solve the problem of the __from_dict__ flag that is not needed if there is no sorting. That flag isn't serialized which causes different behavior when serialized.

Hmm, actually it looks like it could be possible to only add the __from_dict__ flag if sorting is enabled too. I can give that a try and see if it works, wdyt?

BryanCutler · 2019-12-03T20:47:01Z

Ok, I changed to @HyukjinKwon suggestion of using the env var to control sorting only, and also creation of the __from_dict__ flag. I also change to error for Python < 3.6 to generate a warning and fallback to sorting fields and removed the option for using an OrderedDict.

I think this is the most gentle approach to introducing new behavior (of not sorting) and still allowing users to create legacy Rows that are sorted, if needed. It will also be a fairly clean removal when Python < 3.6 is deprecated. I did have to set the env var by default for tests, but I introduced a new test that makes an unsorted Row. This can also be changed once we deprecate Python 2 and remove from testing.

Let me know if there is any issue with the current implementation, and I will add to the migration guide as soon as I can.

SparkQA · 2019-12-03T20:52:44Z

Test build #114803 has finished for PR 26496 at commit 832545b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-03T23:01:54Z

Test build #114804 has finished for PR 26496 at commit 8a8f2aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-12-03T23:14:36Z

python/pyspark/sql/types.py

+    to "true". This option is deprecated and will be removed in future versions
+    of Spark. For Python versions < 3.6, named arguments can no longer be used
+    without enabling field sorting with the environment variable above because
+    order or the arguments is not guaranteed to be the same as entered, see


typo? or -> of?

viirya · 2019-12-03T23:18:07Z

python/pyspark/sql/types.py

+            if not Row._row_field_sorting_enabled and sys.version_info[:2] < (3, 6):
+                warnings.warn("To use named arguments for Python version < 3.6, Row "
+                              "field sorting must be enabled by setting the environment "
+                              "variable 'PYSPARK_ROW_FIELD_SORTING_ENABLED' to 'true'.")


it would be better to say we enable it automatically now.

HyukjinKwon · 2019-12-04T00:28:03Z

python/pyspark/sql/tests/test_types.py

                _make_type_verifier(data_type, nullable=False)(obj)

+    @unittest.skipIf(sys.version_info[:2] < (3, 6), "Create Row without sorting fields")
+    def test_Row_without_field_sorting(self):


no big deal but can we test_Row_without_field_sorting -> test_row_without_field_sorting? Strictly it follows pep8 I guess (and I personally don't like such names in the current codebase ... )

viirya · 2019-12-04T01:24:29Z

python/pyspark/sql/types.py

+
+    NOTE: As of Spark 3.0.0, the Row field names are no longer sorted
+    alphabetically. To enable field sorting to create Rows compatible with
+    Spark 2.x, set the environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED"


I'm curious when this compatibility will be matter? If using Python >= 3.6 at Spark 3.0.0, do users need this compatibility? Or this is just for Python < 3.6 users?

Yeah, good question. I think it's possible for even users that have Python 3.6 to have code that relied on the field names being sorted and this will break their existing code. So they still might need to set the env var until the existing code could be updated. Not sure how likely this scenario is...

SparkQA · 2019-12-04T02:18:14Z

Test build #114814 has finished for PR 26496 at commit d2f0bed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-26T03:25:01Z

retest this please

SparkQA · 2019-12-26T04:08:52Z

Test build #115787 has finished for PR 26496 at commit d2f0bed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…as arg

BryanCutler · 2020-01-06T23:56:39Z

Apologies for the delay, but I've updated with a note in the migration guide and rebased, and removed the WIP. Please take another look @HyukjinKwon @viirya , thanks!

SparkQA · 2020-01-07T00:18:52Z

Test build #116194 has finished for PR 26496 at commit 3a69539.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Looks okay though I haven;t taken a super close look. Should be good to go @BryanCutler if you feel sure. I will take a closer look otherwise in some days.

BryanCutler · 2020-01-07T22:56:04Z

Thanks @HyukjinKwon , I'll give it a few days for any more comments. I'm not crazy about python <3.6 defaulting to the old behavior, but soon we will drop those and it won't be an issue anymore.

SparkQA · 2020-01-10T00:41:40Z

Test build #116430 has finished for PR 26496 at commit 14e691d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2020-01-10T22:40:09Z

merged to master, thanks @HyukjinKwon and @viirya

zsxwing · 2020-02-14T00:28:27Z

docs/pyspark-migration-guide.md

  - Since Spark 3.0, `Column.getItem` is fixed such that it does not call `Column.apply`. Consequently, if `Column` is used as an argument to `getItem`, the indexing operator should be used.
    For example, `map_col.getItem(col('id'))` should be replaced with `map_col[col('id')]`.

+  - As of Spark 3.0 `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to "true". For Python versions less than 3.6, the field names will be sorted alphabetically as the only option.


nit: Could we mention that this must be set for all processes? For example, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLEDto "true" for **executors and driver**. This env must be consistent on all executors and driver. Any inconsistency may cause failures or incorrect answers

+1. Let me fix it.

… variable to set in both executor and driver ### What changes were proposed in this pull request? This PR address the comment at #26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver. ### Why are the changes needed? To clarify this env should be set both in driver and executors. ### Does this PR introduce any user-facing change? Nope. ### How was this patch tested? I checked it via md editor. Closes #27573 from HyukjinKwon/SPARK-29748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

… variable to set in both executor and driver ### What changes were proposed in this pull request? This PR address the comment at #26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver. ### Why are the changes needed? To clarify this env should be set both in driver and executors. ### Does this PR introduce any user-facing change? Nope. ### How was this patch tested? I checked it via md editor. Closes #27573 from HyukjinKwon/SPARK-29748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit b343757) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

… variable to set in both executor and driver ### What changes were proposed in this pull request? This PR address the comment at apache#26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver. ### Why are the changes needed? To clarify this env should be set both in driver and executors. ### Does this PR introduce any user-facing change? Nope. ### How was this patch tested? I checked it via md editor. Closes apache#27573 from HyukjinKwon/SPARK-29748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

dongjoon-hyun added PYSPARK SQL labels Nov 13, 2019

HyukjinKwon reviewed Dec 2, 2019

View reviewed changes

viirya reviewed Dec 4, 2019

View reviewed changes

HyukjinKwon reviewed Dec 4, 2019

View reviewed changes

viirya reviewed Dec 4, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Dec 26, 2019

[SPARK-24915][Python] Fix Row handling with Schema. #26118

Closed

BryanCutler added 12 commits January 6, 2020 14:56

Remove Row field sorting and add LegacyRow with env var

e260920

Start fixing tests

43ff88d

Fix dataframe doctests

c4f5bc8

Fix sql.types doctests

dedb258

Fix sql.functions doctests

cfa1364

Made _LegacyRow private, added pydoc

294f551

Env var now controls sorting only, removed LegacyRow and OrderedDict …

30ec57b

…as arg

Revert all test fixes

de18dce

Set PYSPARK_ROW_FIELD_SORTING_ENABLED to 'true' for run_tests

93fdd45

Added test with sorting disabled

af6d1d9

Move flag to static member of Row, fix test to skip for Python < 3.6

30a3b12

Fixed wording and grammar

140103b

Added note in migration guide

3a69539

BryanCutler force-pushed the pyspark-remove-Row-sorting-SPARK-29748 branch from d2f0bed to 3a69539 Compare January 6, 2020 23:47

BryanCutler changed the title ~~[WIP][SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark~~ [SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ Jan 6, 2020

HyukjinKwon reviewed Jan 7, 2020

View reviewed changes

Clarified wording in doc and added a note about example output

14e691d

BryanCutler closed this in f372d1c Jan 10, 2020

BryanCutler deleted the pyspark-remove-Row-sorting-SPARK-29748 branch January 10, 2020 22:40

zsxwing reviewed Feb 14, 2020

View reviewed changes

HyukjinKwon mentioned this pull request Feb 14, 2020

[SPARK-29748][DOCS][FOLLOW-UP] Add a note that the legacy environment variable to set in both executor and driver #27573

Closed

karenfeng mentioned this pull request Jun 4, 2020

Improve GWAS documentation projectglow/glow#217

Merged

3 tasks

[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ #26496

[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ #26496

Uh oh!

Conversation

BryanCutler commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

BryanCutler commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

BryanCutler commented Nov 19, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

HyukjinKwon commented Dec 26, 2019

Uh oh!

SparkQA commented Dec 26, 2019

Uh oh!

BryanCutler commented Jan 6, 2020

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jan 7, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

BryanCutler commented Jan 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

BryanCutler commented Nov 13, 2019 •

edited

Loading

BryanCutler commented Nov 13, 2019 •

edited

Loading

HyukjinKwon Dec 2, 2019 •

edited

Loading

BryanCutler commented Dec 3, 2019 •

edited

Loading

viirya Dec 4, 2019 •

edited

Loading