catch errors in reflector.start #635

minrk · 2022-08-26T11:54:03Z

ensures failure to start reflectors is fatal, rather than resulting in permanent hangs waiting for first_load_future

I believe this closes #627 (it at least explains one such event).

Relevant bits:

We already have code for reflectors failing to update after a few tries being fatal to the Hub
reflector.start is scheduled asynchronously, via ensure_future. Exceptions on this future are ignored.
methods like start/stop/poll await a first_load_future to ensure the reflector has loaded info at least once
first_load_future is only set on success, it's never set if the first load fails

All these combined mean that if the first load fails, the exception is ignored (it will be logged by asyncio with Task exception was never retrieved), and all subsequent calls wait forever instead of raising, never attempting to restart the reflector.

The events described in #627 are:

reflector starts having trouble talking to k8s (several iohttp.client_exceptions.ClientConnectorError)
it keeps failing, so we get to Pods reflector failed, halting Hub
Hub stops, restarts
On restart, if we are still in the disrupted state, the initial load fails again with the same aiohttp.client_exceptions.ClientConnectorError, but this time it's ignored (ERROR:asyncio:Task exception was never retrieved)
since first_load_future is never set, all start/stop/poll calls are just infinite waits

The fix here is to:

catch (any) error in Reflector.start(), and
call the same fatal cleanup on_reflector_failure function
store the error in first_load_future so that anyone awaiting the first load will get the exception, rather than a forever-incomplete Future

The result should be that the Hub goes into CrashLoopBackoff while it can't talk to the k8s API, which I think is probably correct.

We could also give the first load more retry chances with exponential_backoff before it considers errors fatal, but I'm not sure if that's the right thing to do.

ensures failure to start reflectors is fatal, rather than resulting in permanent hangs waiting for first_load_future

kubespawner/spawner.py

consideRatio

Wieee nice work resolving this very tricky bug to catch! Thank you @minrk, @sgibson91, @abkfenris, @GeorgianaElena, and @gmanuch for working to resolve this and providing information leading to a resolution!

catch errors in reflector.start

9256781

ensures failure to start reflectors is fatal, rather than resulting in permanent hangs waiting for first_load_future

consideRatio reviewed Aug 27, 2022

View reviewed changes

kubespawner/spawner.py Show resolved Hide resolved

consideRatio added the bug label Aug 29, 2022

consideRatio approved these changes Aug 29, 2022

View reviewed changes

consideRatio merged commit 66086a9 into jupyterhub:main Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catch errors in reflector.start #635

catch errors in reflector.start #635

minrk commented Aug 26, 2022

consideRatio left a comment •

edited

Loading

catch errors in reflector.start #635

catch errors in reflector.start #635

Conversation

minrk commented Aug 26, 2022

consideRatio left a comment • edited Loading

Choose a reason for hiding this comment

consideRatio left a comment •

edited

Loading