[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client #38327

amaliujia · 2022-10-21T05:26:20Z

What changes were proposed in this pull request?

Following up on #38276, this PR improve both distinct() and dropDuplicates DataFrame API in Python client, which both depends on Deduplicate plan in the Connect proto.

Why are the changes needed?

Improve API coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

amaliujia · 2022-10-21T05:26:32Z

R: @zhengruifeng @HyukjinKwon

AmplabJenkins · 2022-10-21T15:03:35Z

Can one of the admins verify this patch?

…on client.

HyukjinKwon · 2022-10-24T02:38:38Z

python/pyspark/sql/connect/dataframe.py

+        """
+        if subset is None:
+            return DataFrame.withPlan(
+                plan.Deduplicate(child=self._plan, all_columns_as_keys=True), session=self._session


cc @cloud-fan

cloud-fan · 2022-10-24T02:51:05Z

thanks, merging to master!

…on client ### What changes were proposed in this pull request? Following up on apache#38276, this PR improve both `distinct()` and `dropDuplicates` DataFrame API in Python client, which both depends on `Deduplicate` plan in the Connect proto. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38327 from amaliujia/python_deduplicate. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

amaliujia changed the title ~~[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve deduplicate in Python client~~ [SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client Oct 21, 2022

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 21, 2022

[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve deduplicate in Pyth…

f218c70

…on client.

amaliujia force-pushed the python_deduplicate branch from 10889b5 to f218c70 Compare October 21, 2022 19:02

HyukjinKwon approved these changes Oct 24, 2022

View reviewed changes

HyukjinKwon reviewed Oct 24, 2022

View reviewed changes

cloud-fan approved these changes Oct 24, 2022

View reviewed changes

cloud-fan closed this in 74c8264 Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client #38327

[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client #38327

Uh oh!

amaliujia commented Oct 21, 2022

Uh oh!

amaliujia commented Oct 21, 2022

Uh oh!

AmplabJenkins commented Oct 21, 2022

Uh oh!

HyukjinKwon Oct 24, 2022

Uh oh!

cloud-fan commented Oct 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client #38327

[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client #38327

Uh oh!

Conversation

amaliujia commented Oct 21, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Oct 21, 2022

Uh oh!

AmplabJenkins commented Oct 21, 2022

Uh oh!

HyukjinKwon Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants