Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CustomResource Controllers stop receiving updates after watch reconnect #395

Closed
secondsun opened this issue Apr 16, 2021 · 2 comments
Closed
Assignees

Comments

@secondsun
Copy link
Contributor

Sometimes our controllers stop receiving updates about their custom resources. We've seen that sometimes a watch will become disconnected, and the custom resource event source will try to reconnect but fail to do so. We think that this reconnect failure is causing this problem.

I've traced the reconnection and exceptions to here : https://github.com/java-operator-sdk/java-operator-sdk/blob/master/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/internal/CustomResourceEventSource.java#L157 . I believe that the registerWatch method is throwing an exception which isn't caught by the SDK. because the exception is not caught by the SDK the watch stays dead and the event source no longer sends events. See my logs here where I've isolated the failure I'm describing : https://gist.github.com/secondsun/8e31d2680ff689750c62ad6ce9f419c0

To work around this for now I am trying to use reflection and CDI schedulers to get a reference to the customresourceeventsource and call "onClose" with a subclassed watchexception (secondsun/app-services-operator@a25c4bd#diff-7a83aac91ab02c7b354a2f40496c5d38ca1c88d3f72a391e83b8e265eb341454R69).

Clearly this is not an ideal solution, and I'm looking for workarounds or what the best fix at the SDK level could be.

@secondsun
Copy link
Contributor Author

After thinking I have thought of two "easy" solutions that would work for my use case.

  1. if registerWatch in the onClose(WatchException) method throws an exception, call System.exit(1). This will restart the pod which will reestablish all connections.
  2. Alternatively, retry registerWatch with a exponential backoff.

Harder solutions would include adding events that are fired if a resource's watch dies and allow controllers for those resources to be alterted, or have some way to expose/access the state of the event sources from the application.

@wtrocki
Copy link

wtrocki commented Apr 16, 2021

@secondsun done some reading on why connection issue might happen:

  1. cluster is unhealthy
  2. cluster is overloaded
  3. networking issues between nodes

Reason we seen this is because we got this OpenStack QA clusters that sometimes might get some bugs.

If we keep restarting pod on overloaded clusters we might contribute to the problem, however in this case kubernetes would do hard work on crash loop backoff

Approach nr2 with our own backoff could be problematic as there will be no way to monitor this properly and dropping connections could be one time freak accident that can lead to some CRs not being processed while others will. Obviously checking operator logs would give us this info.

Knowing pros and cons I think going with approach nr1 as other one would require some better monitoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants