-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns #26593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @cloud-fan |
|
Test build #114071 has finished for PR 26593 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
Outdated
Show resolved
Hide resolved
|
Test build #114118 has finished for PR 26593 at commit
|
|
Test build #114120 has finished for PR 26593 at commit
|
|
Test build #114121 has finished for PR 26593 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| private def toAttributes(cols: Seq[String]): Seq[Attribute] = { | ||
| cols.map(df.col(_).named.toAttribute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be more strict:
cols.map(resolve).map {
case a: Attribute => a
case _ => fail(is not a top-level column)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as suggested. One naive question: do we need to be more restrictive here even if we use the result of df.col(_), which uses the outputAttributes of the plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.col may return a nested field, which is not an attribute.
| } | ||
| } | ||
| cols.map(resolve) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I noticed that drop() is resolving column names by cols.map(name => df.resolve(name) instead of df.col(name). The difference (other than the return type) is that df.col() will try to resolve using regex and adding metadata. Do you think we need to make this consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should. To avoid breaking change, I think we should change drop to follow fill to make it more powerful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will do this in a separate PR which will address the issue with handling duplicate columns in drop when no columns are specified.
|
@dongjoon-hyun I think this PR may have been mislabelled. Thanks! |
|
Test build #114178 has finished for PR 26593 at commit
|
| def resolve(colName: String) : Attribute = { | ||
| df.col(colName).named.toAttribute match { | ||
| case a: Attribute => a | ||
| case _ => throw new IllegalArgumentException(s"'$colName' is not a top level column.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for this branch? The code seems not to work as expected. toAttribute always return an Attribute and we never hit this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Let me merge different threads here)
The previous implementation resolved the column names as follow:
- Check the column names against the schema (
df.schema.fields) - Only the column names that matched the top level fields were used for
df.col.
So, if we want to keep the previous behavior of fill (for handling *, and etc.), we first need to check the column names against the schema, then do df.col(colName).named.toAttribute.
Then, the nested field will not pass the schema check (since it's checking against the top-level fields), and toAttribute will always return Attribute. So, I am not sure how this branch can be tested. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we should call df.col to handle * etc. But we shouldn't call .named.toAttribute which turns everything to attribute and make us not able to detect nested fields.
Can we add a test to fill nested fields and see the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested fields are not supported since the name is checked first against df.schema.fields which contains the top-level field, and these do not become candidates for fill. * is handled the same way and it is ignored. I can capture this behavior in the unit test.
Also, what's the best way to convert Column to Attribute? Now that we have a single point of entry def fillValue[T](value: T, cols: Seq[Attribute]), we need to convert Column (result of df.col) to Attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes nested fields should be ignored, we should have a test to verify it. how about
cols.map(name => df.col(name).expr).collect {
case a: Attribute => a
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks!
|
Test build #114277 has finished for PR 26593 at commit
|
|
retest this please |
| private def toAttributes(cols: Seq[String]): Seq[Attribute] = { | ||
| cols.flatMap { colName => | ||
| df.col(colName).expr.collect { | ||
| case a: Attribute => a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this corrected? If df.col returns a nested field GetStructField("a", Attribute("b")), then we will return Attribute("b") which is unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this works, but we can make it clearer.
cols.map(name => df.col(name).expr).collect {
case a: Attribute => a
}
Since we know that struct type column will be ignored later, we don't need to collect them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right. Thanks for the suggestion!
| Row(null) :: Row("b1") :: Nil) | ||
|
|
||
| // Nested columns are ignored for fill(). | ||
| checkAnswer(df.na.fill("a1", Seq("c1.c1-1")), data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can expose the bug if the data is
val data = Seq(
Row(Row(null, "a2")),
Row(Row("b1", "b2")),
Row(null))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example works fine, and you also need to update the previous line to:
checkAnswer(df.select("c1.c1-1"),
Row(null) :: Row("b1") :: Row(null) :: Nil)
(This is just to illustrate the nested type, but I can remove it if you think it's confusing.)
The reason the nested types are ignored is the following check:
case (NumericType, dt) => dt.isInstanceOf[NumericType]
case (StringType, dt) => dt == StringType
case (BooleanType, dt) => dt == BooleanTypeThe datatype for the nested column that is resolved to Attribute is StructType, so this will not be matched.
|
Test build #114285 has finished for PR 26593 at commit
|
|
Test build #114376 has finished for PR 26593 at commit
|
|
retest this please |
|
Test build #114388 has finished for PR 26593 at commit
|
|
thanks, merging to master! |
|
Hi, All. |
|
Sure. Do you want me to handle back-porting this to |
|
Thank you, @imback82 . It would be great.
We usually make another PR against |
|
Also, cc @PavithraRamachandran |
|
Got it. I will work on this today. |
…cate columns (Backport of #26593) ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #27407 from imback82/backport-SPARK-29890. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
DataFrameNaFunctions.filldoesn't handle duplicate columns even when column names are not specified.produces
The reason for the above failure is that columns are looked up with
DataSet.col()which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.This PR updates
DataFrameNaFunctions.fillsuch that if the columns to fill are not specified, it will resolve ambiguity gracefully by applyingfillto all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).Why are the changes needed?
If column names are not specified,
fillshould not fail due to ambiguity since it should still be able to applyfillto the eligible columns.Does this PR introduce any user-facing change?
Yes, now the above example displays the following:
How was this patch tested?
Added new unit tests.