-
Notifications
You must be signed in to change notification settings - Fork 29k
local_connect_and_auth takes 2 arguments, 3 given #29023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When I use `toPandas` on RDD created from Postgresql query I get following error:
```
pyspark/sql/dataframe.py:2138: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation.
local_connect_and_auth() takes 2 positional arguments but 3 were given
warnings.warn(msg)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "__main__.py", line 6, in <module>
run(session)
File "app.py", line 30, in run
list = df.toPandas()
File "pyspark/sql/dataframe.py", line 2121, in toPandas
batches = self._collectAsArrow()
File "pyspark/sql/dataframe.py", line 2179, in _collectAsArrow
return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
File "pyspark/rdd.py", line 144, in _load_from_socket
(sockfile, sock) = local_connect_and_auth(*sock_info)
TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given
```
It disappear when I disable Arrow.
`sock_info` contain following elements:
```
33719
aaf86d48dcee5958e0c4a34c858dc2d6a8bbadb1058b3ac260acaf9f2aa782ed
org.apache.spark.api.python.SocketFuncServer@19359bb7
```
so its `port`, `auth_secret`, and `SocketFunServer` which we don't need here. On spark v3 branch we use only 2 first elements of that tuple.
I backported changes from v3 here to get rid of that bug.
|
ok to test |
|
Hi, @Matzz . Could you file a JIRA issue for this? |
|
Test build #125228 has finished for PR 29023 at commit
|
|
cc @HyukjinKwon |
|
Thanks @dongjoon-hyun for cc'ing me. @Matzz can you show how you tested it? |
| (sockfile, sock) = local_connect_and_auth(*sock_info) | ||
| port = sock_info[0] | ||
| auth_secret = sock_info[1] | ||
| (sockfile, sock) = local_connect_and_auth(port, auth_secret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was correctly ported back at #25593. There's no such code you pointed out in Spark 2.4.4: https://github.com/apache/spark/blob/v2.4.4/python/pyspark/sql/dataframe.py#L2182.
Are you using your own fork or mixing the Spark versions? Your error message seems from https://github.com/apache/spark/blob/v2.4.3/python/pyspark/sql/dataframe.py#L2179 which is Spark 2.4.3. Spark 2.4.3 does not have this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon You are linking to dataframe.py but I patched rdd.py
The code is there:
https://github.com/apache/spark/blob/v2.4.4/python/pyspark/rdd.py#L144
So (*sock_info) is used instead of passing individual touple values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see now. It is possible that that I messed something while using pipenv.
Anyway I feel that using (*sock_info) is inherently unsafe and prone to such errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually already pointed it out it's error-prone in the previous PR. Feel free to open another PR to fix it.
When I use
toPandason RDD created from Postgresql query I get following error:It disappear when I disable Arrow.
sock_infocontain following elements:so its
port,auth_secret, andSocketFunServerwhich we don't need here. On spark v3 branch we use only 2 first elements of that tuple.I backported changes from v3 here to get rid of that bug.
What changes were proposed in this pull request?
Use only needed arguments from
sock_inforather than whole tuple.Why are the changes needed?
To fix following error:
TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were givenDoes this PR introduce any user-facing change?
No
How was this patch tested?
I tested that locally. No new tests were added. Proposed change was backported from v3 branch