[SPARK-40906][SQL] `Mode` should copy keys before inserting into Map #38383

zhengruifeng · 2022-10-25T03:57:06Z

What changes were proposed in this pull request?

Mode should copy keys before inserting into Map

Why are the changes needed?

the result maybe incorrect:

val df = sc.parallelize(Seq.empty[Int], 4)
    .mapPartitionsWithIndex { (idx, iter) =>
         if (idx == 3) {
            Iterator("3", "3", "3", "3", "4")
         } else {
            Iterator("0", "1", "2", "3", "4")
         }
    }.toDF("a")

  df.select(mode(col("a"))).show
+-------+                                                                       
|mode(a)|
+-------+
|      4|
+-------+

after this fix:

  df.select(mode(col("a"))).show
+-------+                                                                       
|mode(a)|
+-------+
|      3|
+-------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

added UT

zhengruifeng · 2022-10-25T03:58:48Z

will also send a separate fix for PandasMode since it's dedicated for Pandas

zhengruifeng · 2022-10-25T06:02:05Z

cc @cloud-fan @beliefer

cloud-fan

good catch!

cloud-fan · 2022-10-25T06:22:48Z

thanks, merging to master!

zhengruifeng · 2022-10-25T06:45:31Z

@cloud-fan thanks for the reviews

zhengruifeng · 2022-10-25T06:45:36Z

@cloud-fan thanks for the reviews

…into Map ### What changes were proposed in this pull request? Make `PandasMode` copy keys before inserting into Map ### Why are the changes needed? correctness issue similar to #38383, make it a separate PR since it is dedicated for Pandas API ``` In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4'] In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f) In [26]: df = spark.createDataFrame(rdd, schema='string') In [27]: psdf = df.pandas_api() In [28]: psdf.mode() Out[28]: value 0 4 In [29]: psdf._to_pandas().mode() Out[29]: value 0 3 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #38385 from zhengruifeng/ps_mode_fix. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…into Map ### What changes were proposed in this pull request? Make `PandasMode` copy keys before inserting into Map ### Why are the changes needed? correctness issue similar to apache/spark#38383, make it a separate PR since it is dedicated for Pandas API ``` In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4'] In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f) In [26]: df = spark.createDataFrame(rdd, schema='string') In [27]: psdf = df.pandas_api() In [28]: psdf.mode() Out[28]: value 0 4 In [29]: psdf._to_pandas().mode() Out[29]: value 0 3 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #38385 from zhengruifeng/ps_mode_fix. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? `Mode` should copy keys before inserting into Map ### Why are the changes needed? the result maybe incorrect: ``` val df = sc.parallelize(Seq.empty[Int], 4) .mapPartitionsWithIndex { (idx, iter) => if (idx == 3) { Iterator("3", "3", "3", "3", "4") } else { Iterator("0", "1", "2", "3", "4") } }.toDF("a") df.select(mode(col("a"))).show +-------+ |mode(a)| +-------+ | 4| +-------+ ``` after this fix: ``` df.select(mode(col("a"))).show +-------+ |mode(a)| +-------+ | 3| +-------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes apache#38383 from zhengruifeng/sql_mode_fix. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…into Map ### What changes were proposed in this pull request? Make `PandasMode` copy keys before inserting into Map ### Why are the changes needed? correctness issue similar to apache#38383, make it a separate PR since it is dedicated for Pandas API ``` In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4'] In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f) In [26]: df = spark.createDataFrame(rdd, schema='string') In [27]: psdf = df.pandas_api() In [28]: psdf.mode() Out[28]: value 0 4 In [29]: psdf._to_pandas().mode() Out[29]: value 0 3 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes apache#38385 from zhengruifeng/ps_mode_fix. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…into Map ### What changes were proposed in this pull request? Make `PandasMode` copy keys before inserting into Map ### Why are the changes needed? correctness issue similar to apache/spark#38383, make it a separate PR since it is dedicated for Pandas API ``` In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4'] In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f) In [26]: df = spark.createDataFrame(rdd, schema='string') In [27]: psdf = df.pandas_api() In [28]: psdf.mode() Out[28]: value 0 4 In [29]: psdf._to_pandas().mode() Out[29]: value 0 3 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #38385 from zhengruifeng/ps_mode_fix. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

fix

a3dcc87

github-actions bot added the SQL label Oct 25, 2022

cloud-fan approved these changes Oct 25, 2022

View reviewed changes

cloud-fan closed this in ae79704 Oct 25, 2022

zhengruifeng deleted the sql_mode_fix branch October 25, 2022 06:45

zhengruifeng mentioned this pull request Oct 25, 2022

[SPARK-40907][PS][SQL] PandasMode should copy keys before inserting into Map #38385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40906][SQL] `Mode` should copy keys before inserting into Map #38383

[SPARK-40906][SQL] `Mode` should copy keys before inserting into Map #38383

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40906][SQL] Mode should copy keys before inserting into Map #38383

[SPARK-40906][SQL] Mode should copy keys before inserting into Map #38383

Uh oh!

Conversation

zhengruifeng commented Oct 25, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

zhengruifeng commented Oct 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40906][SQL] `Mode` should copy keys before inserting into Map #38383

[SPARK-40906][SQL] `Mode` should copy keys before inserting into Map #38383