Skip to content

Conversation

@imback82
Copy link
Contributor

(Backport of #26700)

What changes were proposed in this pull request?

DataFrameNaFunctions.drop doesn't handle duplicate columns even when column names are not specified.

val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.drop("any").show

produces

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)

The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates DataFrameNaFunctions.drop such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying drop to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

Why are the changes needed?

If column names are not specified, drop should not fail due to ambiguity since it should still be able to apply drop to the eligible columns.

Does this PR introduce any user-facing change?

Yes, now all the rows with nulls are dropped in the above example:

scala> df.na.drop("any").show
+----+----+----+
|col1|col2|col2|
+----+----+----+
+----+----+----+

How was this patch tested?

Added new unit tests.

@imback82
Copy link
Contributor Author

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you so much again, @imback82 .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending Jenkins).

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117613 has finished for PR 27411 at commit 7ea21ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jan 31, 2020
…cate columns

(Backport of #26700)

### What changes were proposed in this pull request?

`DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified.

```Scala
val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.drop("any").show
```
produces
```
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
```
The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

### Why are the changes needed?

If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns.

### Does this PR introduce any user-facing change?

Yes, now all the rows with nulls are dropped in the above example:
```
scala> df.na.drop("any").show
+----+----+----+
|col1|col2|col2|
+----+----+----+
+----+----+----+
```

### How was this patch tested?

Added new unit tests.

Closes #27411 from imback82/backport-SPARK-30065.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Merged to branch-2.4.

@imback82 imback82 deleted the backport-SPARK-30065 branch January 31, 2020 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants