[SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names #40170

ueshin · 2023-02-25T01:59:48Z

What changes were proposed in this pull request?

Fixes DataFrame.toPandas to handle duplicated column names.

Why are the changes needed?

Currently

spark.sql("select 1 v, 1 v").toPandas()

fails with the error:

Traceback (most recent call last):
...
  File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
    return self._session.client.to_pandas(query)
  File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index

Simliar to #28210.

Does this PR introduce any user-facing change?

Duplicated column names will be available when calling toPandas().

How was this patch tested?

Enabled related tests.

HyukjinKwon · 2023-02-27T00:22:44Z

Merged to master and branch-3.4.

…mn names ### What changes were proposed in this pull request? Fixes `DataFrame.toPandas` to handle duplicated column names. ### Why are the changes needed? Currently ```py spark.sql("select 1 v, 1 v").toPandas() ``` fails with the error: ```py Traceback (most recent call last): ... File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas return self._session.client.to_pandas(query) File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` Simliar to #28210. ### Does this PR introduce _any_ user-facing change? Duplicated column names will be available when calling `toPandas()`. ### How was this patch tested? Enabled related tests. Closes #40170 from ueshin/issues/SPARK-42574/toPandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 89cf490) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

zhengruifeng · 2023-02-27T01:02:05Z

late LGTM, thanks!

…mn names ### What changes were proposed in this pull request? Fixes `DataFrame.toPandas` to handle duplicated column names. ### Why are the changes needed? Currently ```py spark.sql("select 1 v, 1 v").toPandas() ``` fails with the error: ```py Traceback (most recent call last): ... File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas return self._session.client.to_pandas(query) File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` Simliar to apache#28210. ### Does this PR introduce _any_ user-facing change? Duplicated column names will be available when calling `toPandas()`. ### How was this patch tested? Enabled related tests. Closes apache#40170 from ueshin/issues/SPARK-42574/toPandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 89cf490) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Fix toPandas to handle duplicated column names.

4377e00

ueshin requested review from HyukjinKwon and zhengruifeng February 25, 2023 01:59

github-actions bot added CONNECT CORE PYTHON SQL labels Feb 25, 2023

HyukjinKwon approved these changes Feb 25, 2023

View reviewed changes

Update test_dataframe.py

1c65f60

HyukjinKwon closed this in 89cf490 Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names #40170

[SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names #40170

Uh oh!

ueshin commented Feb 25, 2023

Uh oh!

HyukjinKwon commented Feb 27, 2023

Uh oh!

zhengruifeng commented Feb 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names #40170

[SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names #40170

Uh oh!

Conversation

ueshin commented Feb 25, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 27, 2023

Uh oh!

zhengruifeng commented Feb 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants