[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411

imback82 · 2020-01-31T03:42:59Z

(Backport of #26700)

What changes were proposed in this pull request?

DataFrameNaFunctions.drop doesn't handle duplicate columns even when column names are not specified.

val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.drop("any").show

produces

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)

The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates DataFrameNaFunctions.drop such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying drop to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

Why are the changes needed?

If column names are not specified, drop should not fail due to ambiguity since it should still be able to apply drop to the eligible columns.

Does this PR introduce any user-facing change?

Yes, now all the rows with nulls are dropped in the above example:

scala> df.na.drop("any").show
+----+----+----+
|col1|col2|col2|
+----+----+----+
+----+----+----+

How was this patch tested?

Added new unit tests.

imback82 · 2020-01-31T03:43:29Z

cc @dongjoon-hyun

dongjoon-hyun · 2020-01-31T04:24:03Z

Thank you so much again, @imback82 .

dongjoon-hyun

+1, LGTM. (Pending Jenkins).

SparkQA · 2020-01-31T07:01:10Z

Test build #117613 has finished for PR 27411 at commit 7ea21ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…cate columns (Backport of #26700) ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #27411 from imback82/backport-SPARK-30065. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2020-01-31T07:02:19Z

Merged to branch-2.4.

initial commit

7ea21ef

dongjoon-hyun added the SQL label Jan 31, 2020

dongjoon-hyun approved these changes Jan 31, 2020

View reviewed changes

dongjoon-hyun closed this Jan 31, 2020

imback82 deleted the backport-SPARK-30065 branch January 31, 2020 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411

[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411

Uh oh!

imback82 commented Jan 31, 2020

Uh oh!

imback82 commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

dongjoon-hyun left a comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411

[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411

Uh oh!

Conversation

imback82 commented Jan 31, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants