-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CustomResource Controllers stop receiving updates after watch reconnect #395
Comments
After thinking I have thought of two "easy" solutions that would work for my use case.
Harder solutions would include adding events that are fired if a resource's watch dies and allow controllers for those resources to be alterted, or have some way to expose/access the state of the event sources from the application. |
@secondsun done some reading on why connection issue might happen:
Reason we seen this is because we got this OpenStack QA clusters that sometimes might get some bugs. If we keep restarting pod on overloaded clusters we might contribute to the problem, however in this case kubernetes would do hard work on crash loop backoff Approach nr2 with our own backoff could be problematic as there will be no way to monitor this properly and dropping connections could be one time freak accident that can lead to some CRs not being processed while others will. Obviously checking operator logs would give us this info. Knowing pros and cons I think going with approach nr1 as other one would require some better monitoring. |
Sometimes our controllers stop receiving updates about their custom resources. We've seen that sometimes a watch will become disconnected, and the custom resource event source will try to reconnect but fail to do so. We think that this reconnect failure is causing this problem.
I've traced the reconnection and exceptions to here : https://github.com/java-operator-sdk/java-operator-sdk/blob/master/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/internal/CustomResourceEventSource.java#L157 . I believe that the
registerWatch
method is throwing an exception which isn't caught by the SDK. because the exception is not caught by the SDK the watch stays dead and the event source no longer sends events. See my logs here where I've isolated the failure I'm describing : https://gist.github.com/secondsun/8e31d2680ff689750c62ad6ce9f419c0To work around this for now I am trying to use reflection and CDI schedulers to get a reference to the customresourceeventsource and call "onClose" with a subclassed watchexception (secondsun/app-services-operator@a25c4bd#diff-7a83aac91ab02c7b354a2f40496c5d38ca1c88d3f72a391e83b8e265eb341454R69).
Clearly this is not an ideal solution, and I'm looking for workarounds or what the best fix at the SDK level could be.
The text was updated successfully, but these errors were encountered: