[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(Backport of #26700)
What changes were proposed in this pull request?
DataFrameNaFunctions.dropdoesn't handle duplicate columns even when column names are not specified.produces
The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity.
This PR updates
DataFrameNaFunctions.dropsuch that if the columns to drop are not specified, it will resolve ambiguity gracefully by applyingdropto all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).Why are the changes needed?
If column names are not specified,
dropshould not fail due to ambiguity since it should still be able to applydropto the eligible columns.Does this PR introduce any user-facing change?
Yes, now all the rows with nulls are dropped in the above example:
How was this patch tested?
Added new unit tests.