Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Feb 21, 2023

What changes were proposed in this pull request?

Implement DataFrame.mapInPandas and enable parity tests to vanilla PySpark.

A proto message FrameMap is intorudced for mapInPandas and mapInArrow(to implement next).

Why are the changes needed?

To reach parity with vanilla PySpark.

Does this PR introduce any user-facing change?

Yes. DataFrame.mapInPandas is supported. An example is as shown below.

>>> df = spark.range(2)
>>> def filter_func(iterator):
...   for pdf in iterator:
...     yield pdf[pdf.id == 1]
... 
>>> df.mapInPandas(filter_func, df.schema)
DataFrame[id: bigint]
>>> df.mapInPandas(filter_func, df.schema).show()
+---+                                                                           
| id|
+---+
|  1|
+---+

How was this patch tested?

Unit tests.

SPARK-41661

@xinrong-meng xinrong-meng marked this pull request as ready for review February 23, 2023 09:39
@xinrong-meng xinrong-meng changed the title [WIP][SPARK-42510][CONNECT][PYTHON] Implement DataFrame.mapInPandas [SPARK-42510][CONNECT][PYTHON] Implement DataFrame.mapInPandas Feb 23, 2023
@xinrong-meng
Copy link
Member Author

May I get a review, please? @zhengruifeng @HyukjinKwon

Also cc @grundprinzip @ueshin

Copy link
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

xinrong-meng added a commit that referenced this pull request Feb 25, 2023
### What changes were proposed in this pull request?
Implement `DataFrame.mapInPandas` and enable parity tests to vanilla PySpark.

A proto message `FrameMap` is intorudced for `mapInPandas` and `mapInArrow`(to implement next).

### Why are the changes needed?
To reach parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. `DataFrame.mapInPandas` is supported. An example is as shown below.

```py
>>> df = spark.range(2)
>>> def filter_func(iterator):
...   for pdf in iterator:
...     yield pdf[pdf.id == 1]
...
>>> df.mapInPandas(filter_func, df.schema)
DataFrame[id: bigint]
>>> df.mapInPandas(filter_func, df.schema).show()
+---+
| id|
+---+
|  1|
+---+
```

### How was this patch tested?
Unit tests.

Closes #40104 from xinrong-meng/mapInPandas.

Lead-authored-by: Xinrong Meng <xinrong@apache.org>]
Co-authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Xinrong Meng <xinrong@apache.org>
(cherry picked from commit 9abccad)
Signed-off-by: Xinrong Meng <xinrong@apache.org>
@xinrong-meng
Copy link
Member Author

Merged to master and branch-3.4, thanks all!

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
Implement `DataFrame.mapInPandas` and enable parity tests to vanilla PySpark.

A proto message `FrameMap` is intorudced for `mapInPandas` and `mapInArrow`(to implement next).

### Why are the changes needed?
To reach parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. `DataFrame.mapInPandas` is supported. An example is as shown below.

```py
>>> df = spark.range(2)
>>> def filter_func(iterator):
...   for pdf in iterator:
...     yield pdf[pdf.id == 1]
...
>>> df.mapInPandas(filter_func, df.schema)
DataFrame[id: bigint]
>>> df.mapInPandas(filter_func, df.schema).show()
+---+
| id|
+---+
|  1|
+---+
```

### How was this patch tested?
Unit tests.

Closes apache#40104 from xinrong-meng/mapInPandas.

Lead-authored-by: Xinrong Meng <xinrong@apache.org>]
Co-authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Xinrong Meng <xinrong@apache.org>
(cherry picked from commit 9abccad)
Signed-off-by: Xinrong Meng <xinrong@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants