-
-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix applier hangs which can happen with many watched objects #925
Conversation
…dren When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full. Signed-off-by: Moustafa Baiou <moustafab.ccit@gmail.com>
3848060
to
0bf2f83
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Oh wow. Thanks a lot for reporting this! Annoying case to have to reproduce I imagine, appreciate digging into it and coming up with a fix. As you say, we could potentially expose this as a configuration parameter down the line, but I think the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, that sucks! Thanks for finding and reporting this!
I disagree that this is "the" long-term solution. The buffer should be consumed relatively quickly and written into the scheduler. The root problem here seems to be a deadlock where we stop trying to progress the scheduler ingress while scheduling new jobs, ensuring that the ingress buffer stays full. The proper solution would then be to decouple those two jobs somehow, not to unbound the input buffer. But until then this is a huge improvement over the status quo. |
Do we want to do a 0.73.1 backport of this? The breakage in 0.74.0 honestly feels pretty trivial for getting this bugfix out... :/ |
I think we can rename the milestone to 0.73.1 actually yeah; not because it's not breaking, but because the thing the removal depends on was actually unusable in 0.73 (we had already removed the dependency on crd::v1beta1). |
Didn't the attribute use to still work as long as you set it to |
oh, hm... yes. damnit. |
Maybe it would make sense to just keep it around (but remove it from docs) for when v2 eventually shows up and we'll need to readd it anyway? |
Maybe, but I'd rather take a branch off 0.73 and cherry pick the two fixes for a 0.73.1 release. EDIT: it's also not a clean backout if we were to remove it, since there's lots of v1beta1 stuff handled there. |
That also works. Will you do it or should I? |
If you have time, i'd appreciate it, otherwise i can look at it in the evening. |
…#925) fix applier hangs which can happen with many watched objects and children When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full. Signed-off-by: Moustafa Baiou <moustafab.ccit@gmail.com>
[0.73 backport] fix applier hangs which can happen with many watched objects (#925)
@moustafab @clux This is now released as 0.73.1, and will also be a part of 0.74.0 when that drops. Thanks again! |
Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>
Motivation
We have a controller that uses a number of arbitrary streams (including some watchers) to drive the reconciliation loop. In smaller scale environments with fewer objects, the controller behaves as expected. However in clusters with many objects (and children to those objects) the same controller hangs shortly startup and never reconciles again, nor does it handle any of the relatively fast re-queues from the initial reconciliation loop.
When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full.
Solution
I went with an unbounded channel for the applier because it is the only way to ensure the applier doesn't block, however if it's necessary to use a bounded channel I suppose we could provide an interface for setting the buffer size, but I assume that'd be a breaking change.