Skip to content

Conversation

@amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

Following up on #38276, this PR improve both distinct() and dropDuplicates DataFrame API in Python client, which both depends on Deduplicate plan in the Connect proto.

Why are the changes needed?

Improve API coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @zhengruifeng @HyukjinKwon

@amaliujia amaliujia changed the title [SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve deduplicate in Python client [SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client Oct 21, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

"""
if subset is None:
return DataFrame.withPlan(
plan.Deduplicate(child=self._plan, all_columns_as_keys=True), session=self._session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 74c8264 Oct 24, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…on client

### What changes were proposed in this pull request?

Following up on apache#38276, this PR improve both `distinct()` and `dropDuplicates` DataFrame API in Python client, which both depends on `Deduplicate` plan in the Connect proto.

### Why are the changes needed?

Improve API coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes apache#38327 from amaliujia/python_deduplicate.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants