Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Mar 25, 2020

What changes were proposed in this pull request?

When toPandas API works on duplicate column names produced from operators like join, we see the error like:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This patch fixes the error in toPandas API.

Why are the changes needed?

To make toPandas work on dataframe with duplicate column names.

Does this PR introduce any user-facing change?

Yes. Previously calling toPandas API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.

How was this patch tested?

Unit test.

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120374 has finished for PR 28025 at commit 60fbcf8.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120375 has finished for PR 28025 at commit 6b9d6d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 25, 2020

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

Looks good. a couple of questions.

@SparkQA
Copy link

SparkQA commented Mar 26, 2020

Test build #120394 has finished for PR 28025 at commit 536107e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just had a couple questions

dtype = {}
for field in self.schema:
dtype = [None] * len(self.schema)
for fieldIdx in range(len(self.schema)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use enumerate here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

series = pdf.iloc[:, index].astype(t, copy=False)
else:
series = pdf.iloc[:, index]
df.insert(index, self.schema[index].name, series, allow_duplicates=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make a copy of the data? Seems to go into a make_block method, but I can't tell for sure if that is doing an allocation

Copy link
Member Author

@viirya viirya Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like so. insert calls _sanitize_column which makes a copy of the data.

But pdf.iloc[:, index] = pdf.iloc[:, index].astype(t, copy=False) doesn't work as I replied earlier to @HyukjinKwon. Looks like whether iloc returns a view or a copy, may depend on the context.

@SparkQA
Copy link

SparkQA commented Mar 27, 2020

Test build #120445 has finished for PR 28025 at commit 4270bcc.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2020

Test build #120446 has finished for PR 28025 at commit b8e69e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2020

Test build #120447 has finished for PR 28025 at commit 1cf1f12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master, and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request Mar 27, 2020
…umn names

### What changes were proposed in this pull request?

When `toPandas` API works on duplicate column names produced from operators like join, we see the error like:

```
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
```

This patch fixes the error in `toPandas` API.

### Why are the changes needed?

To make `toPandas` work on dataframe with duplicate column names.

### Does this PR introduce any user-facing change?

Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.

### How was this patch tested?

Unit test.

Closes #28025 from viirya/SPARK-31186.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 559d3e4)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@viirya
Copy link
Member Author

viirya commented Mar 27, 2020

Thanks @HyukjinKwon @BryanCutler

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…umn names

### What changes were proposed in this pull request?

When `toPandas` API works on duplicate column names produced from operators like join, we see the error like:

```
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
```

This patch fixes the error in `toPandas` API.

### Why are the changes needed?

To make `toPandas` work on dataframe with duplicate column names.

### Does this PR introduce any user-facing change?

Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.

### How was this patch tested?

Unit test.

Closes apache#28025 from viirya/SPARK-31186.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@viirya viirya deleted the SPARK-31186 branch December 27, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants