Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Mode should copy keys before inserting into Map

Why are the changes needed?

the result maybe incorrect:

val df = sc.parallelize(Seq.empty[Int], 4)
    .mapPartitionsWithIndex { (idx, iter) =>
         if (idx == 3) {
            Iterator("3", "3", "3", "3", "4")
         } else {
            Iterator("0", "1", "2", "3", "4")
         }
    }.toDF("a")

  df.select(mode(col("a"))).show
+-------+                                                                       
|mode(a)|
+-------+
|      4|
+-------+

after this fix:

  df.select(mode(col("a"))).show
+-------+                                                                       
|mode(a)|
+-------+
|      3|
+-------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

added UT

@github-actions github-actions bot added the SQL label Oct 25, 2022
@zhengruifeng
Copy link
Contributor Author

will also send a separate fix for PandasMode since it's dedicated for Pandas

@zhengruifeng
Copy link
Contributor Author

cc @cloud-fan @beliefer

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ae79704 Oct 25, 2022
@zhengruifeng
Copy link
Contributor Author

@cloud-fan thanks for the reviews

1 similar comment
@zhengruifeng
Copy link
Contributor Author

@cloud-fan thanks for the reviews

@zhengruifeng zhengruifeng deleted the sql_mode_fix branch October 25, 2022 06:45
HyukjinKwon pushed a commit that referenced this pull request Oct 25, 2022
…into Map

### What changes were proposed in this pull request?
Make `PandasMode` copy keys before inserting into Map

### Why are the changes needed?
correctness issue similar to #38383, make it a separate PR since it is dedicated for Pandas API

```
In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]:
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]:
  value
0     3
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #38385 from zhengruifeng/ps_mode_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Oct 25, 2022
…into Map

### What changes were proposed in this pull request?
Make `PandasMode` copy keys before inserting into Map

### Why are the changes needed?
correctness issue similar to apache/spark#38383, make it a separate PR since it is dedicated for Pandas API

```
In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]:
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]:
  value
0     3
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #38385 from zhengruifeng/ps_mode_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?
`Mode` should copy keys before inserting into Map

### Why are the changes needed?
the result maybe incorrect:
```
val df = sc.parallelize(Seq.empty[Int], 4)
    .mapPartitionsWithIndex { (idx, iter) =>
         if (idx == 3) {
            Iterator("3", "3", "3", "3", "4")
         } else {
            Iterator("0", "1", "2", "3", "4")
         }
    }.toDF("a")

  df.select(mode(col("a"))).show
+-------+
|mode(a)|
+-------+
|      4|
+-------+

```

after this fix:
```
  df.select(mode(col("a"))).show
+-------+
|mode(a)|
+-------+
|      3|
+-------+
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes apache#38383 from zhengruifeng/sql_mode_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…into Map

### What changes were proposed in this pull request?
Make `PandasMode` copy keys before inserting into Map

### Why are the changes needed?
correctness issue similar to apache#38383, make it a separate PR since it is dedicated for Pandas API

```
In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]:
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]:
  value
0     3
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes apache#38385 from zhengruifeng/ps_mode_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
…into Map

### What changes were proposed in this pull request?
Make `PandasMode` copy keys before inserting into Map

### Why are the changes needed?
correctness issue similar to apache/spark#38383, make it a separate PR since it is dedicated for Pandas API

```
In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]:
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]:
  value
0     3
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #38385 from zhengruifeng/ps_mode_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
…into Map

### What changes were proposed in this pull request?
Make `PandasMode` copy keys before inserting into Map

### Why are the changes needed?
correctness issue similar to apache/spark#38383, make it a separate PR since it is dedicated for Pandas API

```
In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]:
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]:
  value
0     3
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #38385 from zhengruifeng/ps_mode_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants