Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catch errors in reflector.start #635

Merged
merged 1 commit into from
Aug 29, 2022

Conversation

minrk
Copy link
Member

@minrk minrk commented Aug 26, 2022

ensures failure to start reflectors is fatal, rather than resulting in permanent hangs waiting for first_load_future

I believe this closes #627 (it at least explains one such event).

Relevant bits:

  • We already have code for reflectors failing to update after a few tries being fatal to the Hub
  • reflector.start is scheduled asynchronously, via ensure_future. Exceptions on this future are ignored.
  • methods like start/stop/poll await a first_load_future to ensure the reflector has loaded info at least once
  • first_load_future is only set on success, it's never set if the first load fails

All these combined mean that if the first load fails, the exception is ignored (it will be logged by asyncio with Task exception was never retrieved), and all subsequent calls wait forever instead of raising, never attempting to restart the reflector.

The events described in #627 are:

  • reflector starts having trouble talking to k8s (several iohttp.client_exceptions.ClientConnectorError)
  • it keeps failing, so we get to Pods reflector failed, halting Hub
  • Hub stops, restarts
  • On restart, if we are still in the disrupted state, the initial load fails again with the same aiohttp.client_exceptions.ClientConnectorError, but this time it's ignored (ERROR:asyncio:Task exception was never retrieved)
  • since first_load_future is never set, all start/stop/poll calls are just infinite waits

The fix here is to:

  1. catch (any) error in Reflector.start(), and
  2. call the same fatal cleanup on_reflector_failure function
  3. store the error in first_load_future so that anyone awaiting the first load will get the exception, rather than a forever-incomplete Future

The result should be that the Hub goes into CrashLoopBackoff while it can't talk to the k8s API, which I think is probably correct.

We could also give the first load more retry chances with exponential_backoff before it considers errors fatal, but I'm not sure if that's the right thing to do.

ensures failure to start reflectors is fatal,
rather than resulting in permanent hangs waiting for first_load_future
Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wieee nice work resolving this very tricky bug to catch! Thank you @minrk, @sgibson91, @abkfenris, @GeorgianaElena, and @gmanuch for working to resolve this and providing information leading to a resolution!

@consideRatio consideRatio merged commit 66086a9 into jupyterhub:main Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resilience needed against k8s master outages
2 participants