Handler starts but never finishes #810

euan-tilley · 2021-07-26T17:42:57Z

Long story short

I have recently deploy our operator to an AKS cluster (its been running on EKS & on-prem clusters without any issue) & started noticing that it wasn't
handling changes to CRDs, restarting the container would trigger the handler successfully. At first I thought that we were missing events so explicitly added some api timeouts at startup.
However this didn't make a difference & events still seemed to be being missed. On further investigation (running with --debug startup param) the event did seem to get picked up & the handler triggered but it just seems to hang without ever finishing.

debug | 2021-07-26T17:14:00.672642+00:00 | [thor/thor-configs] Updating diff: (('change', ('spec', 'patch'), '2021-07-26T17:04:08Z', '2021-07-26T17:13:56Z'),)
debug | 2021-07-26T17:14:00.675029+00:00 | [thor/thor-configs] Handler 'config_update_handler' is invoked.

There would be no Handler 'config_update_handler' succeeded. log line (I left it for over 1 hour). It seems it would stay in this state forever.

When restarting the container there are some log lines that take about Unprocessed streams but I haven't been able to figure out why.


debug | 2021-07-26T17:20:03.375018+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d2c6df0> starts SSL handshake
debug | 2021-07-26T17:20:03.394016+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d2c6df0>: SSL handshake took 18.8 ms
debug | 2021-07-26T17:20:03.394890+00:00 | <asyncio.TransportSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.0.48', 57824), raddr=('10.1.0.1', 443)> connected to 10.1.0.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7fe41d4abfa0>, <aiohttp.client_proto.ResponseHandler object at 0x7fe41da1d760>)
debug | 2021-07-26T17:20:03.404854+00:00 | Keep-alive in 'default' cluster-wide: not found.
info | 2021-07-26T17:20:35.960403+00:00 | Signal SIGTERM is received. Operator is stopping.
debug | 2021-07-26T17:20:35.961961+00:00 | Stopping the watch-stream for customresourcedefinitions.v1.apiextensions.k8s.io cluster-wide.
debug | 2021-07-26T17:20:35.965829+00:00 | Namespace observer is cancelled.
debug | 2021-07-26T17:20:35.966790+00:00 | Credentials retriever is cancelled.
debug | 2021-07-26T17:20:35.968802+00:00 | Poster of events is cancelled.
debug | 2021-07-26T17:20:35.976232+00:00 | Stopping the watch-stream for clusterkopfpeerings.v1.kopf.dev cluster-wide.
debug | 2021-07-26T17:20:35.981848+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8c9af0>: SSL error in data received
debug | 2021-07-26T17:20:35.998038+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8ec430>: SSL error in data received
debug | 2021-07-26T17:20:36.002980+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8ec130>: SSL error in data received
debug | 2021-07-26T17:20:36.004243+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8c9c70>: SSL error in data received
debug | 2021-07-26T17:20:36.007808+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8ec1f0>: SSL error in data received
debug | 2021-07-26T17:20:36.008655+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8c9e80>: SSL error in data received
debug | 2021-07-26T17:20:36.010025+00:00 | Stopping the watch-stream for secrets.v1 cluster-wide.
debug | 2021-07-26T17:20:36.012810+00:00 | Stopping the watch-stream for configs.v1.company.com cluster-wide.
debug | 2021-07-26T17:20:36.014347+00:00 | Stopping the watch-stream for apiaccesscredentials.v1.company.com cluster-wide.
debug | 2021-07-26T17:20:36.015304+00:00 | Stopping the watch-stream for certificates.v1.company.com cluster-wide.
debug | 2021-07-26T17:20:36.020593+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d8ec040>: SSL error in data received
debug | 2021-07-26T17:20:36.038528+00:00 | [thor/certificate-server-tls-cert] Timer 'renew_tls_certificate' has exited gracefully.
debug | 2021-07-26T17:20:36.046421+00:00 | Daemon killer is cancelled.
debug | 2021-07-26T17:20:36.053462+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d4b80d0> starts SSL handshake
debug | 2021-07-26T17:20:36.055670+00:00 | Resource observer is cancelled.
debug | 2021-07-26T17:20:36.082348+00:00 | <asyncio.sslproto.SSLProtocol object at 0x7fe41d4b80d0>: SSL handshake took 28.7 ms
debug | 2021-07-26T17:20:36.083909+00:00 | <asyncio.TransportSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.0.48', 58122), raddr=('10.1.0.1', 443)> connected to 10.1.0.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7fe41d9c5580>, <aiohttp.client_proto.ResponseHandler object at 0x7fe4232cf4c0>)
debug | 2021-07-26T17:20:36.092817+00:00 | Keep-alive in 'default' cluster-wide: not found.
warn | 2021-07-26T17:20:38.033284+00:00 | Unprocessed streams left for [(configs.v1.company.com, 'c1a72afd-d36c-45e3-8cf1-3e713f39ac79')].
debug | 2021-07-26T17:20:45.976898+00:00 | Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='watcher for configs.v1.company.com@None' coro=<guard() running at /usr/local/lib/python3.8/site-packages/kopf/utilities/aiotasks.py:69> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/local/lib/python3.8/asyncio/tasks.py:902, <TaskWakeupMethWrapper object at 0x7fe41d2c4a30>()] created at /usr/local/lib/python3.8/asyncio/base_events.py:422> created at /usr/local/lib/python3.8/asyncio/tasks.py:382>}
debug | 2021-07-26T17:20:55.980860+00:00 | Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='watcher for configs.v1.company.com@None' coro=<guard() running at /usr/local/lib/python3.8/site-packages/kopf/utilities/aiotasks.py:69> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/local/lib/python3.8/asyncio/tasks.py:902, <TaskWakeupMethWrapper object at 0x7fe41d2c4a30>()] created at /usr/local/lib/python3.8/asyncio/base_events.py:422> created at /usr/local/lib/python3.8/asyncio/tasks.py:382>}
parse error: Invalid numeric literal at line 556, column 4

Not sure wether this is related to #718 as it does seem similar. One thing to note is that the cluster is not running at scale there are less that 5 resources being watched.

Kopf version

1.29.0

Kubernetes version

1.19.11

Python version

3.8

Code

No response

Logs

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

nolar · 2021-07-26T22:03:49Z

Can you please show how the code looks in general (maybe with no details if they are secret)?

Also, can you add debug logs into the handler and see if the handler is entered and exited. I would assume the issue is that some API operations inside the handler are broken.

euan-tilley · 2021-08-05T15:14:57Z

yes you are right, I believe we are hitting this issue Azure/AKS#1052 so closing.

nolar · 2021-08-05T15:43:06Z

@euan-tilley Thanks for letting me know!

As a side note, keep in mind that there is a very similar issue known to happen with Kopf and any Kubernetes deployment (not only Azure), any version of Kopf & Kubernetes, with any client library (there was the official client, then pykube-ng, now aiohttp): the watch-streams sometimes silently go into the idle mode after several minutes of inactivity, yield no events even when the changes happen, but do not report the TCP disconnection to the client (Kopf in this case) — so the client (Kopf) believes the connection is alive, just nothing happens out there. The root cause is unknown.

To work around this issue, the settings.watching.server_timeout and settings.watching.client_timeout settings were introduced (one is enough, both work even better). The default is None (means: "rely on the server-side defaults"), as it is difficult to decide what is a good default — it depends on the cluster and environment, can vary from 1-5 minutes to 1 hour.

https://kopf.readthedocs.io/en/stable/configuration/#networking-timeouts

This does not affect the connection pools inside the handlers as in this reported issue. Kopf uses its own aiohttp session with its own pool, and the timeouts are applied only there in Kopf's core.

However, if your environment is prone to this kind of connection issues, it is worth setting these timeouts too.

euan-tilley added the bug Something isn't working label Jul 26, 2021

euan-tilley closed this as completed Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handler starts but never finishes #810

Handler starts but never finishes #810

euan-tilley commented Jul 26, 2021

nolar commented Jul 26, 2021

euan-tilley commented Aug 5, 2021

nolar commented Aug 5, 2021

Handler starts but never finishes #810

Handler starts but never finishes #810

Comments

euan-tilley commented Jul 26, 2021

Long story short

Kopf version

Kubernetes version

Python version

Code

Logs

Additional information

nolar commented Jul 26, 2021

euan-tilley commented Aug 5, 2021

nolar commented Aug 5, 2021