[SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` #40104

xinrong-meng · 2023-02-21T12:04:40Z

What changes were proposed in this pull request?

Implement DataFrame.mapInPandas and enable parity tests to vanilla PySpark.

A proto message FrameMap is intorudced for mapInPandas and mapInArrow(to implement next).

Why are the changes needed?

To reach parity with vanilla PySpark.

Does this PR introduce any user-facing change?

Yes. DataFrame.mapInPandas is supported. An example is as shown below.

>>> df = spark.range(2)
>>> def filter_func(iterator):
...   for pdf in iterator:
...     yield pdf[pdf.id == 1]
... 
>>> df.mapInPandas(filter_func, df.schema)
DataFrame[id: bigint]
>>> df.mapInPandas(filter_func, df.schema).show()
+---+                                                                           
| id|
+---+
|  1|
+---+

How was this patch tested?

Unit tests.

SPARK-41661

xinrong-meng · 2023-02-24T03:26:50Z

May I get a review, please? @zhengruifeng @HyukjinKwon

Also cc @grundprinzip @ueshin

amaliujia

LGTM!

### What changes were proposed in this pull request? Implement `DataFrame.mapInPandas` and enable parity tests to vanilla PySpark. A proto message `FrameMap` is intorudced for `mapInPandas` and `mapInArrow`(to implement next). ### Why are the changes needed? To reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.mapInPandas` is supported. An example is as shown below. ```py >>> df = spark.range(2) >>> def filter_func(iterator): ... for pdf in iterator: ... yield pdf[pdf.id == 1] ... >>> df.mapInPandas(filter_func, df.schema) DataFrame[id: bigint] >>> df.mapInPandas(filter_func, df.schema).show() +---+ | id| +---+ | 1| +---+ ``` ### How was this patch tested? Unit tests. Closes #40104 from xinrong-meng/mapInPandas. Lead-authored-by: Xinrong Meng <xinrong@apache.org>] Co-authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> (cherry picked from commit 9abccad) Signed-off-by: Xinrong Meng <xinrong@apache.org>

xinrong-meng · 2023-02-25T00:05:19Z

Merged to master and branch-3.4, thanks all!

### What changes were proposed in this pull request? Implement `DataFrame.mapInPandas` and enable parity tests to vanilla PySpark. A proto message `FrameMap` is intorudced for `mapInPandas` and `mapInArrow`(to implement next). ### Why are the changes needed? To reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.mapInPandas` is supported. An example is as shown below. ```py >>> df = spark.range(2) >>> def filter_func(iterator): ... for pdf in iterator: ... yield pdf[pdf.id == 1] ... >>> df.mapInPandas(filter_func, df.schema) DataFrame[id: bigint] >>> df.mapInPandas(filter_func, df.schema).show() +---+ | id| +---+ | 1| +---+ ``` ### How was this patch tested? Unit tests. Closes apache#40104 from xinrong-meng/mapInPandas. Lead-authored-by: Xinrong Meng <xinrong@apache.org>] Co-authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> (cherry picked from commit 9abccad) Signed-off-by: Xinrong Meng <xinrong@apache.org>

xinrong-meng added 6 commits February 20, 2023 10:24

proto

950870a

proto

67170e7

client

87bf0ce

server

7eb8517

test

8aebfc8

gen

3e9a6d2

github-actions bot added BUILD CONNECT CORE PYTHON SQL labels Feb 21, 2023

xinrong-meng added 6 commits February 23, 2023 13:28

refactor

5eeee14

server fix

5893c3b

pass params

178c1a2

lint

6eb0cc4

rename to_command

d8ea9fb

parity_test

6e3ee77

xinrong-meng marked this pull request as ready for review February 23, 2023 09:39

xinrong-meng changed the title ~~[WIP][SPARK-42510][CONNECT][PYTHON] Implement DataFrame.mapInPandas~~ [SPARK-42510][CONNECT][PYTHON] Implement DataFrame.mapInPandas Feb 23, 2023

xinrong-meng added 3 commits February 23, 2023 17:49

use logical op

ecdd1d2

scala fmt

bda26e7

scala fmt

04a8721

HyukjinKwon approved these changes Feb 24, 2023

View reviewed changes

zhengruifeng approved these changes Feb 24, 2023

View reviewed changes

amaliujia reviewed Feb 24, 2023

View reviewed changes

xinrong-meng closed this in 9abccad Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` #40104

[SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` #40104

Uh oh!

xinrong-meng commented Feb 21, 2023 •

edited

Loading

Uh oh!

xinrong-meng commented Feb 24, 2023

Uh oh!

amaliujia left a comment

Uh oh!

xinrong-meng commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-42510][CONNECT][PYTHON] Implement DataFrame.mapInPandas #40104

[SPARK-42510][CONNECT][PYTHON] Implement DataFrame.mapInPandas #40104

Uh oh!

Conversation

xinrong-meng commented Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xinrong-meng commented Feb 24, 2023

Uh oh!

amaliujia left a comment

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` #40104

[SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` #40104

xinrong-meng commented Feb 21, 2023 •

edited

Loading