Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix applier hangs which can happen with many watched objects #925

Merged
merged 1 commit into from
Jun 3, 2022

Conversation

moustafab
Copy link
Contributor

Motivation

We have a controller that uses a number of arbitrary streams (including some watchers) to drive the reconciliation loop. In smaller scale environments with fewer objects, the controller behaves as expected. However in clusters with many objects (and children to those objects) the same controller hangs shortly startup and never reconciles again, nor does it handle any of the relatively fast re-queues from the initial reconciliation loop.

When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full.

Solution

I went with an unbounded channel for the applier because it is the only way to ensure the applier doesn't block, however if it's necessary to use a bounded channel I suppose we could provide an interface for setting the buffer size, but I assume that'd be a breaking change.

…dren

When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full.

Signed-off-by: Moustafa Baiou <moustafab.ccit@gmail.com>
@codecov-commenter

This comment was marked as off-topic.

@clux clux added the changelog-fix changelog fix category for prs label Jun 3, 2022
@clux clux added this to the 0.74.0 milestone Jun 3, 2022
@clux
Copy link
Member

clux commented Jun 3, 2022

Oh wow. Thanks a lot for reporting this! Annoying case to have to reproduce I imagine, appreciate digging into it and coming up with a fix.

As you say, we could potentially expose this as a configuration parameter down the line, but I think the unbounded solution is the more general one here that's meant to work in any cluster, and is a good Controller default. If that causes memory usage to balloon on busy clusters more than is expected, then that's a separate problem that we can then look into addressing separately.

Copy link
Member

@nightkr nightkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, that sucks! Thanks for finding and reporting this!

@clux clux merged commit c9e0c97 into kube-rs:master Jun 3, 2022
@nightkr
Copy link
Member

nightkr commented Jun 3, 2022

I disagree that this is "the" long-term solution. The buffer should be consumed relatively quickly and written into the scheduler. The root problem here seems to be a deadlock where we stop trying to progress the scheduler ingress while scheduling new jobs, ensuring that the ingress buffer stays full.

The proper solution would then be to decouple those two jobs somehow, not to unbound the input buffer.

But until then this is a huge improvement over the status quo.

@nightkr nightkr added bug Something isn't working runtime controller runtime related labels Jun 3, 2022
@nightkr
Copy link
Member

nightkr commented Jun 3, 2022

Do we want to do a 0.73.1 backport of this? The breakage in 0.74.0 honestly feels pretty trivial for getting this bugfix out... :/

@clux
Copy link
Member

clux commented Jun 3, 2022

I think we can rename the milestone to 0.73.1 actually yeah; not because it's not breaking, but because the thing the removal depends on was actually unusable in 0.73 (we had already removed the dependency on crd::v1beta1).

@nightkr
Copy link
Member

nightkr commented Jun 3, 2022

Didn't the attribute use to still work as long as you set it to v1?

@clux
Copy link
Member

clux commented Jun 3, 2022

oh, hm... yes. damnit.

@nightkr
Copy link
Member

nightkr commented Jun 3, 2022

Maybe it would make sense to just keep it around (but remove it from docs) for when v2 eventually shows up and we'll need to readd it anyway?

@clux
Copy link
Member

clux commented Jun 3, 2022

Maybe, but I'd rather take a branch off 0.73 and cherry pick the two fixes for a 0.73.1 release.

EDIT: it's also not a clean backout if we were to remove it, since there's lots of v1beta1 stuff handled there.

@nightkr
Copy link
Member

nightkr commented Jun 3, 2022

That also works. Will you do it or should I?

@clux
Copy link
Member

clux commented Jun 3, 2022

If you have time, i'd appreciate it, otherwise i can look at it in the evening.

nightkr pushed a commit to nightkr/kube-rs that referenced this pull request Jun 3, 2022
…#925)

fix applier hangs which can happen with many watched objects and children

When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full.

Signed-off-by: Moustafa Baiou <moustafab.ccit@gmail.com>
@moustafab
Copy link
Contributor Author

@clux @teozkr Thanks so much for the quick turn around!

nightkr added a commit that referenced this pull request Jun 3, 2022
[0.73 backport] fix applier hangs which can happen with many watched objects (#925)
@nightkr
Copy link
Member

nightkr commented Jun 3, 2022

@moustafab @clux This is now released as 0.73.1, and will also be a part of 0.74.0 when that drops. Thanks again!

nightkr added a commit to nightkr/kube-rs that referenced this pull request Jun 8, 2022
nightkr added a commit to nightkr/kube-rs that referenced this pull request Jun 8, 2022
Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>
@moustafab moustafab deleted the moustafab/applier-hang branch May 9, 2023 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working changelog-fix changelog fix category for prs runtime controller runtime related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants