-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28897][SQL] 'coalesce' error when executing dataframe.na.fill #27392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28897][SQL] 'coalesce' error when executing dataframe.na.fill #27392
Conversation
|
ok to test |
|
Hi, @PavithraRamachandran . Thank you for making a PR. Could you open this to |
|
@dongjoon-hyun this issue is not present in master. It got fixed due to some implementation changes done for https://issues.apache.org/jira/browse/SPARK-29890 |
|
Then, can we backport that? We want to minimize a different implementation. |
|
Did you ping on the JIRA or that PR? You should do that first. |
|
i pinged on the above jira and was working it. the implementation change made for resolving jira SPARk-29890 fixed this jira issue too in master. I was not sure if the entire changes made for JIRA-29890 is needed in spark 2.4 , So i raised my fix. If we can backport JIRA-29890 , then we can close this and close the above jira too by backporting. |
|
Sorry, but are you sure? I cannot find your comment on https://issues.apache.org/jira/browse/SPARK-29890 .
|
|
First of all, we need to close SPARK-28897 as a duplicate of SPARK-29890. Then, we need to ask a backport. That's the way. |
|
I understand your feeling, but we prefer to have a consistent JIRA and patch for the same issues of the different branches. BTW, we don't backport everything. Since you asked, I pinging on SPARK-28897 PR. Let's see. |
What changes were proposed in this pull request?
Root Cause:
When a dataframe is created using select statement (using spark.sql.parser.quotedRegexColumnNames=true) dataframe fill is called- the fillCol in DataFrameNaFunctions, ``(backtick) are added explicitly to the columnNames, the column name is misunderstood to be a regex and it is set as an unresolvedregex, which makes the coalesce resolving to fail.
Observation
When we create the dataframe from the select statement using a regex, valid columns names are returned after applying the filter(regex). So adding backticks to column name in this flow was not needed. To check the impact, select statement with regex were used, there was no impact while executing without the backticks.
After Fix
While passing the columnname to the dataframe column method, ``(backtick) are not added, as the value that is received is not a regular expression, but a valid column name.
Why are the changes needed?
By doing this change column name is not considered as regex and the proper Column function is
And does not fail to resolve the expression.
Does this PR introduce any user-facing change?
NA
How was this patch tested?
unit test