Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Nov 13, 2019

What changes were proposed in this pull request?

Removing the sorting of PySpark SQL Row fields that were previously sorted by name alphabetically for Python versions 3.6 and above. Field order will now match that as entered. Rows will be used like tuples and are applied to schema by position. For Python versions < 3.6, the order of kwargs is not guaranteed and therefore will be sorted automatically as in previous versions of Spark.

Why are the changes needed?

This caused inconsistent behavior in that local Rows could be applied to a schema by matching names, but once serialized the Row could only be used by position and the fields were possibly in a different order.

Does this PR introduce any user-facing change?

Yes, Row fields are no longer sorted alphabetically but will be in the order entered. For Python < 3.6 kwargs can not guarantee the order as entered, so Rows will be automatically sorted.

An environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" can be set that will override construction of Row to maintain compatibility with Spark 2.x.

How was this patch tested?

Existing tests are run with PYSPARK_ROW_FIELD_SORTING_ENABLED=true and added new test with unsorted fields for Python 3.6+

@BryanCutler
Copy link
Member Author

BryanCutler commented Nov 13, 2019

WIP still need to do:

  • Add note to migration guide
  • Check docs
  • Fix remaining test failures
  • Add new tests

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113676 has finished for PR 26496 at commit ecd8f83.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113819 has finished for PR 26496 at commit b85c3a6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _LegacyRow(Row):

@BryanCutler
Copy link
Member Author

ping @HyukjinKwon @viirya for thoughts on the proposed implementation. For testing of Python 2.7, I think we will have to set the env variable to use LegacyRow by default, otherwise there is lots of code to change.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late response. +1 from me.


if kwargs:
# create row objects
names = sorted(kwargs.keys())
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after a second thought, why don't we just have an env to switch on and off the sorting, and disable it in Spark 3.0, and remove the env out in Spark 3.1? I think it will need less changes I suspect (rather than having a separate class for legacy row)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do that but that doesn't solve the problem of the __from_dict__ flag that is not needed if there is no sorting. That flag isn't serialized which causes different behavior when serialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually it looks like it could be possible to only add the __from_dict__ flag if sorting is enabled too. I can give that a try and see if it works, wdyt?

@BryanCutler
Copy link
Member Author

BryanCutler commented Dec 3, 2019

Ok, I changed to @HyukjinKwon suggestion of using the env var to control sorting only, and also creation of the __from_dict__ flag. I also change to error for Python < 3.6 to generate a warning and fallback to sorting fields and removed the option for using an OrderedDict.

I think this is the most gentle approach to introducing new behavior (of not sorting) and still allowing users to create legacy Rows that are sorted, if needed. It will also be a fairly clean removal when Python < 3.6 is deprecated. I did have to set the env var by default for tests, but I introduced a new test that makes an unsorted Row. This can also be changed once we deprecate Python 2 and remove from testing.

Let me know if there is any issue with the current implementation, and I will add to the migration guide as soon as I can.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114803 has finished for PR 26496 at commit 832545b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114804 has finished for PR 26496 at commit 8a8f2aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

to "true". This option is deprecated and will be removed in future versions
of Spark. For Python versions < 3.6, named arguments can no longer be used
without enabling field sorting with the environment variable above because
order or the arguments is not guaranteed to be the same as entered, see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? or -> of?

if not Row._row_field_sorting_enabled and sys.version_info[:2] < (3, 6):
warnings.warn("To use named arguments for Python version < 3.6, Row "
"field sorting must be enabled by setting the environment "
"variable 'PYSPARK_ROW_FIELD_SORTING_ENABLED' to 'true'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to say we enable it automatically now.

_make_type_verifier(data_type, nullable=False)(obj)

@unittest.skipIf(sys.version_info[:2] < (3, 6), "Create Row without sorting fields")
def test_Row_without_field_sorting(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no big deal but can we test_Row_without_field_sorting -> test_row_without_field_sorting? Strictly it follows pep8 I guess (and I personally don't like such names in the current codebase ... )

NOTE: As of Spark 3.0.0, the Row field names are no longer sorted
alphabetically. To enable field sorting to create Rows compatible with
Spark 2.x, set the environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED"
Copy link
Member

@viirya viirya Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious when this compatibility will be matter? If using Python >= 3.6 at Spark 3.0.0, do users need this compatibility? Or this is just for Python < 3.6 users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good question. I think it's possible for even users that have Python 3.6 to have code that relied on the field names being sorted and this will break their existing code. So they still might need to set the env var until the existing code could be updated. Not sure how likely this scenario is...

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114814 has finished for PR 26496 at commit d2f0bed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115787 has finished for PR 26496 at commit d2f0bed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the pyspark-remove-Row-sorting-SPARK-29748 branch from d2f0bed to 3a69539 Compare January 6, 2020 23:47
@BryanCutler BryanCutler changed the title [WIP][SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark [SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ Jan 6, 2020
@BryanCutler
Copy link
Member Author

Apologies for the delay, but I've updated with a note in the migration guide and rebased, and removed the WIP. Please take another look @HyukjinKwon @viirya , thanks!

@SparkQA
Copy link

SparkQA commented Jan 7, 2020

Test build #116194 has finished for PR 26496 at commit 3a69539.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay though I haven;t taken a super close look. Should be good to go @BryanCutler if you feel sure. I will take a closer look otherwise in some days.

@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon , I'll give it a few days for any more comments. I'm not crazy about python <3.6 defaulting to the old behavior, but soon we will drop those and it won't be an issue anymore.

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116430 has finished for PR 26496 at commit 14e691d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler deleted the pyspark-remove-Row-sorting-SPARK-29748 branch January 10, 2020 22:40
@BryanCutler
Copy link
Member Author

merged to master, thanks @HyukjinKwon and @viirya

- Since Spark 3.0, `Column.getItem` is fixed such that it does not call `Column.apply`. Consequently, if `Column` is used as an argument to `getItem`, the indexing operator should be used.
For example, `map_col.getItem(col('id'))` should be replaced with `map_col[col('id')]`.

- As of Spark 3.0 `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to "true". For Python versions less than 3.6, the field names will be sorted alphabetically as the only option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we mention that this must be set for all processes? For example, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLEDto "true" for **executors and driver**. This env must be consistent on all executors and driver. Any inconsistency may cause failures or incorrect answers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let me fix it.

asfgit pushed a commit that referenced this pull request Feb 14, 2020
… variable to set in both executor and driver

### What changes were proposed in this pull request?

This PR address the comment at #26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver.

### Why are the changes needed?

To clarify this env should be set both in driver and executors.

### Does this PR introduce any user-facing change?

Nope.

### How was this patch tested?

I checked it via md editor.

Closes #27573 from HyukjinKwon/SPARK-29748.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
asfgit pushed a commit that referenced this pull request Feb 14, 2020
… variable to set in both executor and driver

### What changes were proposed in this pull request?

This PR address the comment at #26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver.

### Why are the changes needed?

To clarify this env should be set both in driver and executors.

### Does this PR introduce any user-facing change?

Nope.

### How was this patch tested?

I checked it via md editor.

Closes #27573 from HyukjinKwon/SPARK-29748.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
(cherry picked from commit b343757)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… variable to set in both executor and driver

### What changes were proposed in this pull request?

This PR address the comment at apache#26496 (comment) and improves the migration guide to explicitly note that the legacy environment variable to set in both executor and driver.

### Why are the changes needed?

To clarify this env should be set both in driver and executors.

### Does this PR introduce any user-facing change?

Nope.

### How was this patch tested?

I checked it via md editor.

Closes apache#27573 from HyukjinKwon/SPARK-29748.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants