Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Apr 14, 2020

What changes were proposed in this pull request?

This PR is adding support duplicated column names for toPandas with Arrow execution.

Why are the changes needed?

When we execute toPandas() with Arrow execution, it fails if the column names have duplicates.

>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index

Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1

How was this patch tested?

Added and modified related tests.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121244 has finished for PR 28210 at commit 762d9dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

The last comment was comment only. Merged to master and branch-3.0.

@HyukjinKwon
Copy link
Member

Should we backport this and SPARK-31186, @viirya and @ueshin?

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121250 has finished for PR 28210 at commit 8ecdf33.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Apr 14, 2020
… execution

### What changes were proposed in this pull request?

This PR is adding support duplicated column names for `toPandas` with Arrow execution.

### Why are the changes needed?

When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

### Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1
```

### How was this patch tested?

Added and modified related tests.

Closes #28210 from ueshin/issues/SPARK-31441/to_pandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 87be364)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@viirya
Copy link
Member

viirya commented Apr 14, 2020

@HyukjinKwon Backport to branch-2.4? Ok, sounds good to me. I will prepare a backport of SPARK-31186.

@ueshin
Copy link
Member Author

ueshin commented Apr 14, 2020

@HyukjinKwon Sure, I'll submit the backport PR after @viirya's is merged.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… execution

### What changes were proposed in this pull request?

This PR is adding support duplicated column names for `toPandas` with Arrow execution.

### Why are the changes needed?

When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

### Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1
```

### How was this patch tested?

Added and modified related tests.

Closes apache#28210 from ueshin/issues/SPARK-31441/to_pandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
ueshin added a commit that referenced this pull request Apr 15, 2020
…toPandas with arrow execution

### What changes were proposed in this pull request?

This is to backport #28210.

This PR is adding support duplicated column names for `toPandas` with Arrow execution.

### Why are the changes needed?

When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

### Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1
```

### How was this patch tested?

Added and modified related tests.

Closes #28221 from ueshin/issues/SPARK-31441/2.4/to_pandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
@gatorsmile
Copy link
Member

For other reviewers, this PR is the follow-up of #28025

HyukjinKwon pushed a commit that referenced this pull request Feb 27, 2023
…mn names

### What changes were proposed in this pull request?

Fixes `DataFrame.toPandas` to handle duplicated column names.

### Why are the changes needed?

Currently

```py
spark.sql("select 1 v, 1 v").toPandas()
```

fails with the error:

```py
Traceback (most recent call last):
...
  File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
    return self._session.client.to_pandas(query)
  File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

Simliar to #28210.

### Does this PR introduce _any_ user-facing change?

Duplicated column names will be available when calling `toPandas()`.

### How was this patch tested?

Enabled related tests.

Closes #40170 from ueshin/issues/SPARK-42574/toPandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 27, 2023
…mn names

### What changes were proposed in this pull request?

Fixes `DataFrame.toPandas` to handle duplicated column names.

### Why are the changes needed?

Currently

```py
spark.sql("select 1 v, 1 v").toPandas()
```

fails with the error:

```py
Traceback (most recent call last):
...
  File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
    return self._session.client.to_pandas(query)
  File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

Simliar to #28210.

### Does this PR introduce _any_ user-facing change?

Duplicated column names will be available when calling `toPandas()`.

### How was this patch tested?

Enabled related tests.

Closes #40170 from ueshin/issues/SPARK-42574/toPandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 89cf490)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…mn names

### What changes were proposed in this pull request?

Fixes `DataFrame.toPandas` to handle duplicated column names.

### Why are the changes needed?

Currently

```py
spark.sql("select 1 v, 1 v").toPandas()
```

fails with the error:

```py
Traceback (most recent call last):
...
  File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas
    return self._session.client.to_pandas(query)
  File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

Simliar to apache#28210.

### Does this PR introduce _any_ user-facing change?

Duplicated column names will be available when calling `toPandas()`.

### How was this patch tested?

Enabled related tests.

Closes apache#40170 from ueshin/issues/SPARK-42574/toPandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 89cf490)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants