-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31186][PySpark][SQL] toPandas should not fail on duplicate column names #28025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #120374 has finished for PR 28025 at commit
|
|
Test build #120375 has finished for PR 28025 at commit
|
|
cc @HyukjinKwon |
|
Looks good. a couple of questions. |
|
Test build #120394 has finished for PR 28025 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cc @BryanCutler @ueshin
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just had a couple questions
| dtype = {} | ||
| for field in self.schema: | ||
| dtype = [None] * len(self.schema) | ||
| for fieldIdx in range(len(self.schema)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use enumerate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
| series = pdf.iloc[:, index].astype(t, copy=False) | ||
| else: | ||
| series = pdf.iloc[:, index] | ||
| df.insert(index, self.schema[index].name, series, allow_duplicates=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make a copy of the data? Seems to go into a make_block method, but I can't tell for sure if that is doing an allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like so. insert calls _sanitize_column which makes a copy of the data.
But pdf.iloc[:, index] = pdf.iloc[:, index].astype(t, copy=False) doesn't work as I replied earlier to @HyukjinKwon. Looks like whether iloc returns a view or a copy, may depend on the context.
|
Test build #120445 has finished for PR 28025 at commit
|
|
Test build #120446 has finished for PR 28025 at commit
|
|
Test build #120447 has finished for PR 28025 at commit
|
|
Merged to master, and branch-3.0. |
…umn names ### What changes were proposed in this pull request? When `toPandas` API works on duplicate column names produced from operators like join, we see the error like: ``` ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` This patch fixes the error in `toPandas` API. ### Why are the changes needed? To make `toPandas` work on dataframe with duplicate column names. ### Does this PR introduce any user-facing change? Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result. ### How was this patch tested? Unit test. Closes #28025 from viirya/SPARK-31186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 559d3e4) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
|
Thanks @HyukjinKwon @BryanCutler |
…umn names ### What changes were proposed in this pull request? When `toPandas` API works on duplicate column names produced from operators like join, we see the error like: ``` ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` This patch fixes the error in `toPandas` API. ### Why are the changes needed? To make `toPandas` work on dataframe with duplicate column names. ### Does this PR introduce any user-facing change? Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result. ### How was this patch tested? Unit test. Closes apache#28025 from viirya/SPARK-31186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
When
toPandasAPI works on duplicate column names produced from operators like join, we see the error like:This patch fixes the error in
toPandasAPI.Why are the changes needed?
To make
toPandaswork on dataframe with duplicate column names.Does this PR introduce any user-facing change?
Yes. Previously calling
toPandasAPI on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.How was this patch tested?
Unit test.