[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe #24614

viirya · 2019-05-15T11:17:31Z

What changes were proposed in this pull request?

In PySpark, Row's __from_dict__ is lost after pickle. But we rely on __from_dict__ when converting Rows to internal by calling toInternal. It causes a weird behavior:

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
Row(B='2', A='1') # correct
>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
Row(B='1', A='2') # incorrect

This patch tried to fix the issue.

How was this patch tested?

Added test.

dataframe.

SparkQA · 2019-05-15T11:38:05Z

Test build #105414 has finished for PR 24614 at commit 473fb50.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-15T14:13:56Z

This is more interesting, as we allow something like:

data = [Row(key=i, value=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")

So, the question is, in createDataFrame, should we respect original Row's schema in the RDD?

Currently,

In case creating dataframe from local list of Row, we respect the Row's schema.
In case from RDD of Row, we don't respect it, as shown in the example in the PR description.

It is inconsistent in two cases, obviously.

This difference is also seen in following case. Field names can't differ, if from local list of Row.

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, a string").first()
Traceback (most recent call last):
  File "/Users/viirya/repos/spark-1/python/pyspark/sql/types.py", line 1527, in __getitem__
    idx = self.__fields__.index(item)
ValueError: 'a' is not in list

cc @HyukjinKwon @cloud-fan

HyukjinKwon · 2019-05-16T01:25:26Z

yea, I think actually I discussed this with @BryanCutler somewhere before. I forget what we ended up with. Bryan, do you remember which one we considered as the correct case?

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
Row(B='2', A='1') # correct
>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
Row(B='1', A='2') # incorrect

I remember we considered __from_dict__ is being mistakenly lost and therefore it should be considered as a dict but maybe I am recalling wrongly.

BryanCutler · 2019-05-17T21:37:02Z

Ah yes, there are all kinds of inconsistencies in the PySpark Row class. I think this is a duplicate of SPARK-22232 and we discussed in the PR here #20280. The fix there was to also pickle the __from_dict__ flag, but the problem becomes that the input schema in createDataFrame must match field names and no longer goes off of position. At the time, I thought this was the most consistent fix, but I'll have to look at it again.

BryanCutler · 2019-05-17T21:45:31Z

The problem I when using a positional schema when constructing Rows using a dict is that the Row constructor sorts the fields, so any sense of position is lost. For example:

This works

data = [Row(k=i, v=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")

This fails

data = [Row(z=i, y=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")

where the only difference is the field name from Row, so it really isn't positional.

viirya · 2019-05-19T00:05:46Z

This is a duplicate issue and there was discussion before, I close this.

Returns correct schema even under different column order when creating

473fb50

dataframe.

viirya closed this May 19, 2019

viirya deleted the SPARK-27712 branch December 27, 2023 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe #24614

[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe #24614

Uh oh!

viirya commented May 15, 2019

Uh oh!

SparkQA commented May 15, 2019

Uh oh!

viirya commented May 15, 2019

Uh oh!

HyukjinKwon commented May 16, 2019

Uh oh!

BryanCutler commented May 17, 2019

Uh oh!

BryanCutler commented May 17, 2019

Uh oh!

viirya commented May 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe #24614

[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe #24614

Uh oh!

Conversation

viirya commented May 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 15, 2019

Uh oh!

viirya commented May 15, 2019

Uh oh!

HyukjinKwon commented May 16, 2019

Uh oh!

BryanCutler commented May 17, 2019

Uh oh!

BryanCutler commented May 17, 2019

Uh oh!

viirya commented May 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants