-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31441] Support duplicated column names for toPandas with arrow execution. #28210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #121244 has finished for PR 28210 at commit
|
|
The last comment was comment only. Merged to master and branch-3.0. |
|
Should we backport this and SPARK-31186, @viirya and @ueshin? |
|
Test build #121250 has finished for PR 28210 at commit
|
… execution
### What changes were proposed in this pull request?
This PR is adding support duplicated column names for `toPandas` with Arrow execution.
### Why are the changes needed?
When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.
```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
pdf = table.to_pandas()
File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```
### Does this PR introduce any user-facing change?
Yes, previously we will face an error above, but after this PR, we will see the result:
```py
>>> spark.sql("select 1 v, 1 v").toPandas()
v v
0 1 1
```
### How was this patch tested?
Added and modified related tests.
Closes #28210 from ueshin/issues/SPARK-31441/to_pandas.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 87be364)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
|
@HyukjinKwon Backport to branch-2.4? Ok, sounds good to me. I will prepare a backport of SPARK-31186. |
|
@HyukjinKwon Sure, I'll submit the backport PR after @viirya's is merged. |
… execution
### What changes were proposed in this pull request?
This PR is adding support duplicated column names for `toPandas` with Arrow execution.
### Why are the changes needed?
When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.
```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
pdf = table.to_pandas()
File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```
### Does this PR introduce any user-facing change?
Yes, previously we will face an error above, but after this PR, we will see the result:
```py
>>> spark.sql("select 1 v, 1 v").toPandas()
v v
0 1 1
```
### How was this patch tested?
Added and modified related tests.
Closes apache#28210 from ueshin/issues/SPARK-31441/to_pandas.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…toPandas with arrow execution ### What changes were proposed in this pull request? This is to backport #28210. This PR is adding support duplicated column names for `toPandas` with Arrow execution. ### Why are the changes needed? When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates. ```py >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` ### Does this PR introduce any user-facing change? Yes, previously we will face an error above, but after this PR, we will see the result: ```py >>> spark.sql("select 1 v, 1 v").toPandas() v v 0 1 1 ``` ### How was this patch tested? Added and modified related tests. Closes #28221 from ueshin/issues/SPARK-31441/2.4/to_pandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
|
For other reviewers, this PR is the follow-up of #28025 |
…mn names
### What changes were proposed in this pull request?
Fixes `DataFrame.toPandas` to handle duplicated column names.
### Why are the changes needed?
Currently
```py
spark.sql("select 1 v, 1 v").toPandas()
```
fails with the error:
```py
Traceback (most recent call last):
...
File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
return self._session.client.to_pandas(query)
File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
pdf = table.to_pandas()
File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```
Simliar to #28210.
### Does this PR introduce _any_ user-facing change?
Duplicated column names will be available when calling `toPandas()`.
### How was this patch tested?
Enabled related tests.
Closes #40170 from ueshin/issues/SPARK-42574/toPandas.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…mn names
### What changes were proposed in this pull request?
Fixes `DataFrame.toPandas` to handle duplicated column names.
### Why are the changes needed?
Currently
```py
spark.sql("select 1 v, 1 v").toPandas()
```
fails with the error:
```py
Traceback (most recent call last):
...
File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
return self._session.client.to_pandas(query)
File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
pdf = table.to_pandas()
File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```
Simliar to #28210.
### Does this PR introduce _any_ user-facing change?
Duplicated column names will be available when calling `toPandas()`.
### How was this patch tested?
Enabled related tests.
Closes #40170 from ueshin/issues/SPARK-42574/toPandas.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 89cf490)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…mn names
### What changes were proposed in this pull request?
Fixes `DataFrame.toPandas` to handle duplicated column names.
### Why are the changes needed?
Currently
```py
spark.sql("select 1 v, 1 v").toPandas()
```
fails with the error:
```py
Traceback (most recent call last):
...
File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
return self._session.client.to_pandas(query)
File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
pdf = table.to_pandas()
File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```
Simliar to apache#28210.
### Does this PR introduce _any_ user-facing change?
Duplicated column names will be available when calling `toPandas()`.
### How was this patch tested?
Enabled related tests.
Closes apache#40170 from ueshin/issues/SPARK-42574/toPandas.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 89cf490)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR is adding support duplicated column names for
toPandaswith Arrow execution.Why are the changes needed?
When we execute
toPandas()with Arrow execution, it fails if the column names have duplicates.Does this PR introduce any user-facing change?
Yes, previously we will face an error above, but after this PR, we will see the result:
How was this patch tested?
Added and modified related tests.