Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 15, 2019

What changes were proposed in this pull request?

In PySpark, Row's __from_dict__ is lost after pickle. But we rely on __from_dict__ when converting Rows to internal by calling toInternal. It causes a weird behavior:

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
Row(B='2', A='1') # correct
>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
Row(B='1', A='2') # incorrect                                                         

This patch tried to fix the issue.

How was this patch tested?

Added test.

@SparkQA
Copy link

SparkQA commented May 15, 2019

Test build #105414 has finished for PR 24614 at commit 473fb50.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 15, 2019

This is more interesting, as we allow something like:

data = [Row(key=i, value=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")

So, the question is, in createDataFrame, should we respect original Row's schema in the RDD?

Currently,

  • In case creating dataframe from local list of Row, we respect the Row's schema.
  • In case from RDD of Row, we don't respect it, as shown in the example in the PR description.

It is inconsistent in two cases, obviously.

This difference is also seen in following case. Field names can't differ, if from local list of Row.

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, a string").first()
Traceback (most recent call last):
  File "/Users/viirya/repos/spark-1/python/pyspark/sql/types.py", line 1527, in __getitem__
    idx = self.__fields__.index(item)
ValueError: 'a' is not in list

cc @HyukjinKwon @cloud-fan

@HyukjinKwon
Copy link
Member

yea, I think actually I discussed this with @BryanCutler somewhere before. I forget what we ended up with. Bryan, do you remember which one we considered as the correct case?

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
Row(B='2', A='1') # correct
>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
Row(B='1', A='2') # incorrect        

I remember we considered __from_dict__ is being mistakenly lost and therefore it should be considered as a dict but maybe I am recalling wrongly.

@BryanCutler
Copy link
Member

Ah yes, there are all kinds of inconsistencies in the PySpark Row class. I think this is a duplicate of SPARK-22232 and we discussed in the PR here #20280. The fix there was to also pickle the __from_dict__ flag, but the problem becomes that the input schema in createDataFrame must match field names and no longer goes off of position. At the time, I thought this was the most consistent fix, but I'll have to look at it again.

@BryanCutler
Copy link
Member

The problem I when using a positional schema when constructing Rows using a dict is that the Row constructor sorts the fields, so any sense of position is lost. For example:

This works

data = [Row(k=i, v=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")

This fails

data = [Row(z=i, y=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")

where the only difference is the field name from Row, so it really isn't positional.

@viirya
Copy link
Member Author

viirya commented May 19, 2019

This is a duplicate issue and there was discussion before, I close this.

@viirya viirya closed this May 19, 2019
@viirya viirya deleted the SPARK-27712 branch December 27, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants