Skip to content

Conversation

@Matzz
Copy link

@Matzz Matzz commented Jul 7, 2020

When I use toPandas on RDD created from Postgresql query I get following error:

pyspark/sql/dataframe.py:2138: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation.
  local_connect_and_auth() takes 2 positional arguments but 3 were given
  warnings.warn(msg)

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "__main__.py", line 6, in <module>
    run(session)
  File "app.py", line 30, in run
    list = df.toPandas()
  File "pyspark/sql/dataframe.py", line 2121, in toPandas
    batches = self._collectAsArrow()
  File "pyspark/sql/dataframe.py", line 2179, in _collectAsArrow
    return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
  File "pyspark/rdd.py", line 144, in _load_from_socket
    (sockfile, sock) = local_connect_and_auth(*sock_info)
TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given

It disappear when I disable Arrow.

sock_info contain following elements:

33719
aaf86d48dcee5958e0c4a34c858dc2d6a8bbadb1058b3ac260acaf9f2aa782ed
org.apache.spark.api.python.SocketFuncServer@19359bb7

so its port, auth_secret, and SocketFunServer which we don't need here. On spark v3 branch we use only 2 first elements of that tuple.
I backported changes from v3 here to get rid of that bug.

What changes were proposed in this pull request?

Use only needed arguments from sock_info rather than whole tuple.

Why are the changes needed?

To fix following error:
TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given

Does this PR introduce any user-facing change?

No

How was this patch tested?

I tested that locally. No new tests were added. Proposed change was backported from v3 branch

When I use `toPandas` on RDD created from Postgresql query I get following error:
```
pyspark/sql/dataframe.py:2138: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation.
  local_connect_and_auth() takes 2 positional arguments but 3 were given
  warnings.warn(msg)

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "__main__.py", line 6, in <module>
    run(session)
  File "app.py", line 30, in run
    list = df.toPandas()
  File "pyspark/sql/dataframe.py", line 2121, in toPandas
    batches = self._collectAsArrow()
  File "pyspark/sql/dataframe.py", line 2179, in _collectAsArrow
    return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
  File "pyspark/rdd.py", line 144, in _load_from_socket
    (sockfile, sock) = local_connect_and_auth(*sock_info)
TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given
```
It disappear when I disable Arrow. 

`sock_info` contain following elements:
```
33719
aaf86d48dcee5958e0c4a34c858dc2d6a8bbadb1058b3ac260acaf9f2aa782ed
org.apache.spark.api.python.SocketFuncServer@19359bb7
```
so its `port`, `auth_secret`, and `SocketFunServer` which we don't need here. On spark v3 branch we use only 2 first elements of that tuple.
I backported changes from v3 here to get rid of that bug.
@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

Hi, @Matzz . Could you file a JIRA issue for this?

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125228 has finished for PR 29023 at commit 2e2534f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

Thanks @dongjoon-hyun for cc'ing me. @Matzz can you show how you tested it?

(sockfile, sock) = local_connect_and_auth(*sock_info)
port = sock_info[0]
auth_secret = sock_info[1]
(sockfile, sock) = local_connect_and_auth(port, auth_secret)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was correctly ported back at #25593. There's no such code you pointed out in Spark 2.4.4: https://github.com/apache/spark/blob/v2.4.4/python/pyspark/sql/dataframe.py#L2182.

Are you using your own fork or mixing the Spark versions? Your error message seems from https://github.com/apache/spark/blob/v2.4.3/python/pyspark/sql/dataframe.py#L2179 which is Spark 2.4.3. Spark 2.4.3 does not have this change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon You are linking to dataframe.py but I patched rdd.py
The code is there:
https://github.com/apache/spark/blob/v2.4.4/python/pyspark/rdd.py#L144

So (*sock_info) is used instead of passing individual touple values.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see now. It is possible that that I messed something while using pipenv.
Anyway I feel that using (*sock_info) is inherently unsafe and prone to such errors.

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually already pointed it out it's error-prone in the previous PR. Feel free to open another PR to fix it.

@HyukjinKwon HyukjinKwon closed this Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants