[SPARK-31186][PySpark][SQL] toPandas should not fail on duplicate column names #28025

viirya · 2020-03-25T20:40:39Z

What changes were proposed in this pull request?

When toPandas API works on duplicate column names produced from operators like join, we see the error like:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This patch fixes the error in toPandas API.

Why are the changes needed?

To make toPandas work on dataframe with duplicate column names.

Does this PR introduce any user-facing change?

Yes. Previously calling toPandas API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.

How was this patch tested?

Unit test.

SparkQA · 2020-03-25T20:48:06Z

Test build #120374 has finished for PR 28025 at commit 60fbcf8.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-25T22:12:20Z

Test build #120375 has finished for PR 28025 at commit 6b9d6d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-03-25T22:20:54Z

cc @HyukjinKwon

python/pyspark/sql/pandas/conversion.py

HyukjinKwon · 2020-03-26T01:45:14Z

Looks good. a couple of questions.

SparkQA · 2020-03-26T05:44:26Z

Test build #120394 has finished for PR 28025 at commit 536107e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM cc @BryanCutler @ueshin

BryanCutler

Looks good, just had a couple questions

BryanCutler · 2020-03-26T16:51:30Z

python/pyspark/sql/pandas/conversion.py

-        dtype = {}
-        for field in self.schema:
+        dtype = [None] * len(self.schema)
+        for fieldIdx in range(len(self.schema)):


better to use enumerate here?

BryanCutler · 2020-03-26T16:53:10Z

python/pyspark/sql/pandas/conversion.py

+                series = pdf.iloc[:, index].astype(t, copy=False)
+            else:
+                series = pdf.iloc[:, index]
+            df.insert(index, self.schema[index].name, series, allow_duplicates=True)


Does this make a copy of the data? Seems to go into a make_block method, but I can't tell for sure if that is doing an allocation

Looks like so. insert calls _sanitize_column which makes a copy of the data.

But pdf.iloc[:, index] = pdf.iloc[:, index].astype(t, copy=False) doesn't work as I replied earlier to @HyukjinKwon. Looks like whether iloc returns a view or a copy, may depend on the context.

SparkQA · 2020-03-27T01:33:29Z

Test build #120445 has finished for PR 28025 at commit 4270bcc.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-27T02:14:42Z

Test build #120446 has finished for PR 28025 at commit b8e69e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-27T02:43:49Z

Test build #120447 has finished for PR 28025 at commit 1cf1f12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-27T03:10:32Z

Merged to master, and branch-3.0.

…umn names ### What changes were proposed in this pull request? When `toPandas` API works on duplicate column names produced from operators like join, we see the error like: ``` ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` This patch fixes the error in `toPandas` API. ### Why are the changes needed? To make `toPandas` work on dataframe with duplicate column names. ### Does this PR introduce any user-facing change? Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result. ### How was this patch tested? Unit test. Closes #28025 from viirya/SPARK-31186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 559d3e4) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

viirya · 2020-03-27T04:30:34Z

Thanks @HyukjinKwon @BryanCutler

…umn names ### What changes were proposed in this pull request? When `toPandas` API works on duplicate column names produced from operators like join, we see the error like: ``` ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` This patch fixes the error in `toPandas` API. ### Why are the changes needed? To make `toPandas` work on dataframe with duplicate column names. ### Does this PR introduce any user-facing change? Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result. ### How was this patch tested? Unit test. Closes apache#28025 from viirya/SPARK-31186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Deal with duplicate column names.

60fbcf8

Fix style.

6b9d6d6

dongjoon-hyun added PYSPARK SQL labels Mar 25, 2020

HyukjinKwon reviewed Mar 26, 2020

View reviewed changes

python/pyspark/sql/pandas/conversion.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 26, 2020

View reviewed changes

python/pyspark/sql/pandas/conversion.py Show resolved Hide resolved

For comment.

536107e

HyukjinKwon approved these changes Mar 26, 2020

View reviewed changes

BryanCutler reviewed Mar 26, 2020

View reviewed changes

Avoid using insert for non-duplicate column names.

b8e69e0

viirya force-pushed the SPARK-31186 branch from 4270bcc to b8e69e0 Compare March 27, 2020 01:37

Add some comments to explain it.

1cf1f12

HyukjinKwon closed this in 559d3e4 Mar 27, 2020

gatorsmile mentioned this pull request Apr 19, 2020

[SPARK-31441] Support duplicated column names for toPandas with arrow execution. #28210

Closed

viirya deleted the SPARK-31186 branch December 27, 2023 18:23

[SPARK-31186][PySpark][SQL] toPandas should not fail on duplicate column names #28025

[SPARK-31186][PySpark][SQL] toPandas should not fail on duplicate column names #28025

Uh oh!

Conversation

viirya commented Mar 25, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Mar 25, 2020

Uh oh!

SparkQA commented Mar 25, 2020

Uh oh!

viirya commented Mar 25, 2020

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Mar 26, 2020

Uh oh!

SparkQA commented Mar 26, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

BryanCutler Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 27, 2020

Uh oh!

SparkQA commented Mar 27, 2020

Uh oh!

SparkQA commented Mar 27, 2020

Uh oh!

HyukjinKwon commented Mar 27, 2020

Uh oh!

viirya commented Mar 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya Mar 27, 2020 •

edited

Loading