You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]
Problem / symptom:
In the logs from controller-dask-gateway, we see this:
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]
Details of the problem:
We are running JupyterHub and DaskHub (through official helm chart) - in AKS, using Azure CNI plugin. We have some network policies implemented in the daskhub namespace.
While using the jupyter notebooks (through a browser of-course), the following piece of code would wait for a minute and would then throw errors.
The logs from controller-dask-gateway showed the following errors:
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]
From the error messages, it looks like Dask-kube-controller cannot talk to kubernetes api on 10.0.0.1:443 .
Apparently it looks like that some firewall/network-policy is blocking the communication from Dask controller to k8s API.
This is not the case though. Inspection of Network policies show that the egress traffic is allowed.
ssl connection from inside the dask controller to k8s api IP and port seems to work in terms of TCP/IP connection. So it is not firewall or network policies.
There seems to be SSL handshake problem because of certificate mismatch, or because the CA certificate issued by k8s is not added to dask controller, which causes this failure.
Below are logs from the controller-dask-gateway pod.
[kamran@kworkhorse ~]$ kubectl -n daskhub logs -f controller-dask-gateway-668498765b-67lmk
[I 2022-06-02 20:04:27.349 KubeController] Starting dask-gateway-kube-controller - version 2022.4.0
[I 2022-06-02 20:04:27.501 KubeController] dask-gateway-kube-controller started!
[I 2022-06-02 20:04:27.502 KubeController] API listening at http://:8000
[E 2022-06-05 01:02:20.171 KubeController] Error in endpoints informer, retrying...
Traceback (most recent call last):
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 986, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1089, in create_connection
transport, protocol = await self._create_connection_transport(
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1119, in _create_connection_transport
await waiter
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/dask/.local/lib/python3.10/site-packages/dask_gateway_server/backends/kubernetes/utils.py", line 149, in run
initial = await method(**self.method_kwargs)
File "/home/dask/.local/lib/python3.10/site-packages/dask_gateway_server/backends/kubernetes/utils.py", line 47, in func
return await method(*args, **kwargs)
File "/home/dask/.local/lib/python3.10/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
response_data = await self.request(
File "/home/dask/.local/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
return (await self.request("GET", url,
File "/home/dask/.local/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
r = await self.pool_manager.request(**args)
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/client.py", line 535, in _request
conn = await self._connector.connect(
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 542, in connect
proto = await self._create_connection(req, traces, timeout)
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 907, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
raise last_exc
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]
[E 2022-06-05 10:24:44.857 KubeController] Error in endpoints informer, retrying...
Traceback (most recent call last):
File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 986, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1089, in create_connection
transport, protocol = await self._create_connection_transport(
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1119, in _create_connection_transport
await waiter
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
. . .
^C
Find if we can reach 10.0.0.1:443 from controller-dask-gateway:
Looks like we can reach 10.0.0.1:443 , but there is a certificate error. So, at least it is not a firewall/network policy issue.
dask@controller-dask-gateway-668498765b-67lmk:~$ openssl s_client -connect 10.0.0.1:443
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = apiserver
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = apiserver
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN = apiserver
verify return:1
---
Certificate chain
0 s:CN = apiserver
i:CN = ca
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIF+TCCA+GgAwIBAgIQcwW8FOcPZD0aaEJ8SB8aGTANBgkqhkiG9w0BAQsFADAN
MQswCQYDVQQDEwJjYTAeFw0yMjA2MDIxODE0MjBaFw0yNDA2MDIxODI0MjBaMBQx
EjAQBgNVBAMTCWFwaXNlcnZlcjCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoC
ggIBAJ48Qfk5HhAf71Cb9MzvY2hwb+tA3H022thdUiI3nxYhrkSUiXA+GzyZjb8t
ChV8Ecjxn1m/WeKfvuQ32T19PDmu2rlYhr2J1VPwd2r6ZTsJesi4R98EhxnxKZ7W
GGsjWu/E45yIuOFojIGBGDCEbKYAHe6U9xvEUUruGpY8gJQ8ms+sH6UBYz3aGqfv
oxaMMiuqC5FMgbnsle1JubpryyyaGwrk7m5OAn1aeB1qKfO85OhVl9oKXS3e2J2E
80uslbbqF/KP8zm1k5ilHEzwbP1eisqqWFcqWov0rZrfgWGrIYj2dNCVSAjl2iLM
VqKVFj7ki9uOhitCGInBQIfjvyzwtv1GrioZuAepL1/L1AJjfF4dsmcMCBX3WjzF
hwqjGaDk4/n4JoF8bYoXP1npfbtFWsqvDWAOwNUDSBvK4gePuBTjGyn0/YRS084F
OcG5npQyjD0aM/rQUv2pHA7esUQwQdMTUX4an6WBVJTyd/fRpvECvjtf3BNL/hUC
gUO3esC+K7KsJpWc0ZIOXIA3lmDN7vmivqvF2s4Xnd86QQhY7Txxr6aMfVby
-----END CERTIFICATE-----
subject=CN = apiserver
issuer=CN = ca
---
Acceptable client certificate CA names
CN = ca
CN = agg-ca
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 2456 bytes and written 393 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 21 (unable to verify the first certificate)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
Protocol : TLSv1.3
Cipher : TLS_AES_256_GCM_SHA384
Session-ID: A592BA296F8ADD7A8F8E50D4C424A876A10964DCA7DAD7D591F5170147716A53
Session-ID-ctx:
Resumption PSK: C11FDA4709BFC86AC6E79F20274D63EAB7E356538FED72690D21859BD74600A6894FBEDA37E6EDBF4796C1D495C349CC
PSK identity: None
PSK identity hint: None
SRP username: None
TLS session ticket lifetime hint: 604800 (seconds)
TLS session ticket:
0000 - c3 31 95 4d c0 a4 8c 9c-87 dc dc 1f 36 78 0a 41 .1.M........6x.A
0010 - 15 2b b6 36 db 49 fa f3-8c 6a 0f af 74 54 18 e3 .+.6.I...j..tT..
0020 - 93 cc 2b 2b ec 1d 8b 90-70 c9 0b f4 8e 1d 64 f3 ..++....p.....d.
0060 - 74 9e e4 28 c1 7d b5 e3-4b 81 7e 59 a5 d2 f6 f3 t..(.}..K.~Y....
0070 - be b9 3e 72 b6 0e e6 59-8c 15 c4 03 65 cb bf 74 ..>r...Y....e..t
0080 - ad .
Start Time: 1654596393
Timeout : 7200 (sec)
Verify return code: 21 (unable to verify the first certificate)
Extended master secret: no
Max Early Data: 0
---
read R BLOCK
closed
dask@controller-dask-gateway-668498765b-67lmk:~$
Try accessing 10.0.0.1:443 by providing CA certificate manually:
This seems to work.
dask@controller-dask-gateway-668498765b-lgcks:~$ openssl s_client -connect 10.0.0.1:443 -CAfile /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
CONNECTED(00000003)
Can't use SSL_get_servername
depth=1 CN = ca
verify return:1
depth=0 CN = apiserver
verify return:1
---
Certificate chain
0 s:CN = apiserver
i:CN = ca
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIF+TCCA+GgAwIBAgIQcwW8FOcPZD0aaEJ8SB8aGTANBgkqhkiG9w0BAQsFADAN
MQswCQYDVQQDEwJjYTAeFw0yMjA2MDIxODE0MjBaFw0yNDA2MDIxODI0MjBaMBQx
EjAQBgNVBAMTCWFwaXNlcnZlcjCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoC
ggIBAJ48Qfk5HhAf71Cb9MzvY2hwb+tA3H022thdUiI3nxYhrkSUiXA+GzyZjb8t
ChV8Ecjxn1m/WeKfvuQ32T19PDmu2rlYhr2J1VPwd2r6ZTsJesi4R98EhxnxKZ7W
GGsjWu/E45yIuOFojIGBGDCEbKYAHe6U9xvEUUruGpY8gJQ8ms+sH6UBYz3aGqfv
dMY59wPo7gEh9PMN6+N/OioPxb6nCi8Bw7hPaDN8KLitHwwJwVklZgZTl8/DPZ2h
4YJIV/1k4BdVQ7rBCALbnAexreiHgiUbxaLFfyYujI3ITWG4zta4LC5JsZTajdaT
oxaMMiuqC5FMgbnsle1JubpryyyaGwrk7m5OAn1aeB1qKfO85OhVl9oKXS3e2J2E
80uslbbqF/KP8zm1k5ilHEzwbP1eisqqWFcqWov0rZrfgWGrIYj2dNCVSAjl2iLM
VqKVFj7ki9uOhitCGInBQIfjvyzwtv1GrioZuAepL1/L1AJjfF4dsmcMCBX3WjzF
hwqjGaDk4/n4JoF8bYoXP1npfbtFWsqvDWAOwNUDSBvK4gePuBTjGyn0/YRS084F
OcG5npQyjD0aM/rQUv2pHA7esUQwQdMTUX4an6WBVJTyd/fRpvECvjtf3BNL/hUC
gUO3esC+K7KsJpWc0ZIOXIA3lmDN7vmivqvF2s4Xnd86QQhY7Txxr6aMfVby
-----END CERTIFICATE-----
subject=CN = apiserver
issuer=CN = ca
---
Acceptable client certificate CA names
CN = ca
CN = agg-ca
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 2456 bytes and written 393 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
Protocol : TLSv1.3
Cipher : TLS_AES_256_GCM_SHA384
Session-ID: 4ACA638F9D4AEB2BD704C6F6264EC3E381B936C5A4D5A1CA543856E23F5B6A49
Session-ID-ctx:
Resumption PSK: 4C3BEE30B6E4F0AE977B0B8426B157BCA50F0AD15562B670055970B0096EA244485B33E3225BBA1934AB6E0D2A73F5DF
PSK identity: None
PSK identity hint: None
SRP username: None
TLS session ticket lifetime hint: 604800 (seconds)
TLS session ticket:
0000 - c3 31 95 4d c0 a4 8c 9c-87 dc dc 1f 36 78 0a 41 .1.M........6x.A
0010 - 66 36 e6 2c 50 2c ea 57-ba b2 db 89 5b 16 ee 80 f6.,P,.W....[...
0020 - 34 60 67 54 80 39 c3 88-16 2f 1e 1e d4 2a fc b3 4`gT.9.../...*..
0060 - 04 55 ce 3b 9e e4 8d 83-0a 97 c7 13 b2 73 37 0a .U.;.........s7.
0070 - 56 dc 50 61 4a 0d 76 71-1d 9b 9c b9 2c aa 79 a3 V.PaJ.vq....,.y.
0080 - a6 .
Start Time: 1654599650
Timeout : 7200 (sec)
Verify return code: 0 (ok)
Extended master secret: no
Max Early Data: 0
---
read R BLOCK
closed
The above tests simply prove that controller-dask-gateway is actually able to reach 10.0.0.1:443 , and no firewall or network policy is preventing this from happening.
We noticed that there is no setting in DASK API server configuration to provide CAfile. I would assume that DASK knows how to use the correct certificate (ca.crt) from the mounted/projected service account inside the pod. So ideally this should not be the cause of the problem.
So, what is the problem then?
Errors from the traefik pod:
There was a separate error being registered in the traefik pod - shown below.
time="2022-06-13T09:08:23Z" level=debug msg="vulcand/oxy/roundrobin/rr: Forwarding this request to URL" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"\",\"Opaque\":\"\",\"User\":null,\"Host\":\"\",\"Path\":\"/api/v1/options\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\",\"RawFragment\":\"\"},\"Proto\":\"HTTP/1.1\",\"ProtoMajor\":1,\"ProtoMinor\":1,\"Header\":{\"Accept\":[\"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\"],\"Accept-Encoding\":[\"gzip, deflate, br\"],\"Accept-Language\":[\"en-US,en;q=0.9\"],\"Connection\":[\"close\"],\"Cookie\":[\"jupyterhub-services=2|1:0|10:1654628753|19:jupyterhub-services|0:|20e5bf6702fbf493a47a021c7d133f62fc9c215a403a938cc541da628eb86c55; _odp_apps_oauth_session=s%3Ad8efd49c-e99f-4936-b016-61baff34592c.Ylp1d%2Fkuuw6sZsuBOA9KrmZeN73TIBdOipBVSd3Ul%2BM; jupyterhub-session-id=fd3293661df24d4bbae9ac6904f4815b\"],\"Sec-Fetch-Dest\":[\"document\"],\"Sec-Fetch-Mode\":[\"navigate\"],\"Sec-Fetch-Site\":[\"none\"],\"Sec-Fetch-User\":[\"?1\"],\"Sec-Gpc\":[\"1\"],\"Upgrade-Insecure-Requests\":[\"1\"],\"User-Agent\":[\"Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36\"],\"X-Forwarded-Host\":[\"dask.dev.oceandata.xyz\"],\"X-Forwarded-Port\":[\"80\"],\"X-Forwarded-Prefix\":[\"/services/dask-gateway\"],\"X-Forwarded-Proto\":[\"http\"],\"X-Forwarded-Server\":[\"traefik-dask-gateway-5b64bff4f7-2qrh7\"],\"X-Real-Ip\":[\"10.240.0.96\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"dask.dev.oceandata.xyz\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"10.240.0.96:55150\",\"RequestURI\":\"/api/v1/options\",\"TLS\":null}" ForwardURL="http://10.240.0.192:8000"
time="2022-06-13T09:08:53Z" level=debug msg="'504 Gateway Timeout' caused by: dial tcp 10.240.0.192:8000: i/o timeout"
From the above log entry, it looks like traefik-dask-gateway is unable to access api-dask-gateway ( 10.240.0.192:8000 ).
Above, is a completely different problem, but we added an entry in the network policy of the api-dask-gateway, so traefik can talk to api-dask-server on port 8000. As soon as we applied this network policy, the SSL errors in the controller-dask-gateway stopped appearing anymore.
As you can see the error messages in the beginning of this issue were completely misleading. They forced us to think about fixing communication between controller-dask-gateway and the Kubernetes API server / endpoint, whereas the actual problem was traefik unable to reach api-dask-gateway . This caused us a lot of frustration, not to mention the hair loss in the process! This entire thing is documented so others can benefit from it, and hopefully the error messages in the DASK components can be improved.
Thanks for a detailed writeup! I'm on mobile and skimmed through the issue so far only.
We have some network policies implemented in the daskhub namespace.
As soon as you let a networkpolicy target a pod, it becomes locked down to what is explicitly allowed by networkpolicies targetting it. The dask-gateway helm chart does not bundle with networkpolicies, and therefore isnt locked down by default to what is needed for core functionality, so by adding a networkpolicy targetting pods it work with, you could have caused this, and it would be expected.
In my mind, the dask-gateway helm chart should ideally do that, and it seems you have figured out a lot of the networking you had to allow for core functionality after you ended up locking them down. Nice!
Alternate issue description:
Problem / symptom:
In the logs from controller-dask-gateway, we see this:
Details of the problem:
We are running JupyterHub and DaskHub (through official helm chart) - in AKS, using Azure CNI plugin. We have some network policies implemented in the daskhub namespace.
While using the jupyter notebooks (through a browser of-course), the following piece of code would wait for a minute and would then throw errors.
The logs from controller-dask-gateway showed the following errors:
10.0.0.1:443
.Below are logs from the
controller-dask-gateway
pod.Find if we can reach
10.0.0.1:443
from controller-dask-gateway:Looks like we can reach
10.0.0.1:443
, but there is a certificate error. So, at least it is not a firewall/network policy issue.Try accessing
10.0.0.1:443
by providing CA certificate manually:This seems to work.
The above tests simply prove that
controller-dask-gateway
is actually able to reach10.0.0.1:443
, and no firewall or network policy is preventing this from happening.We noticed that there is no setting in DASK API server configuration to provide CAfile. I would assume that DASK knows how to use the correct certificate (ca.crt) from the mounted/projected service account inside the pod. So ideally this should not be the cause of the problem.
So, what is the problem then?
Errors from the traefik pod:
There was a separate error being registered in the traefik pod - shown below.
From the above log entry, it looks like
traefik-dask-gateway
is unable to accessapi-dask-gateway
(10.240.0.192:8000
).Solution:
Above, is a completely different problem, but we added an entry in the network policy of the api-dask-gateway, so
traefik
can talk toapi-dask-server
on port8000
. As soon as we applied this network policy, the SSL errors in thecontroller-dask-gateway
stopped appearing anymore.As you can see the error messages in the beginning of this issue were completely misleading. They forced us to think about fixing communication between
controller-dask-gateway
and the Kubernetes API server / endpoint, whereas the actual problem wastraefik
unable to reachapi-dask-gateway
. This caused us a lot of frustration, not to mention the hair loss in the process! This entire thing is documented so others can benefit from it, and hopefully the error messages in the DASK components can be improved.The complete policy file for api-dask-gateway looks like this:
The text was updated successfully, but these errors were encountered: