fix applier hangs which can happen with many watched objects #925

moustafab · 2022-06-03T04:39:50Z

Motivation

We have a controller that uses a number of arbitrary streams (including some watchers) to drive the reconciliation loop. In smaller scale environments with fewer objects, the controller behaves as expected. However in clusters with many objects (and children to those objects) the same controller hangs shortly startup and never reconciles again, nor does it handle any of the relatively fast re-queues from the initial reconciliation loop.

When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full.

Solution

I went with an unbounded channel for the applier because it is the only way to ensure the applier doesn't block, however if it's necessary to use a bounded channel I suppose we could provide an interface for setting the buffer size, but I assume that'd be a breaking change.

…dren When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full. Signed-off-by: Moustafa Baiou <moustafab.ccit@gmail.com>

clux · 2022-06-03T11:10:59Z

Oh wow. Thanks a lot for reporting this! Annoying case to have to reproduce I imagine, appreciate digging into it and coming up with a fix.

As you say, we could potentially expose this as a configuration parameter down the line, but I think the unbounded solution is the more general one here that's meant to work in any cluster, and is a good Controller default. If that causes memory usage to balloon on busy clusters more than is expected, then that's a separate problem that we can then look into addressing separately.

nightkr

Oof, that sucks! Thanks for finding and reporting this!

nightkr · 2022-06-03T11:17:50Z

I disagree that this is "the" long-term solution. The buffer should be consumed relatively quickly and written into the scheduler. The root problem here seems to be a deadlock where we stop trying to progress the scheduler ingress while scheduling new jobs, ensuring that the ingress buffer stays full.

The proper solution would then be to decouple those two jobs somehow, not to unbound the input buffer.

But until then this is a huge improvement over the status quo.

nightkr · 2022-06-03T11:19:18Z

Do we want to do a 0.73.1 backport of this? The breakage in 0.74.0 honestly feels pretty trivial for getting this bugfix out... :/

clux · 2022-06-03T11:25:39Z

I think we can rename the milestone to 0.73.1 actually yeah; not because it's not breaking, but because the thing the removal depends on was actually unusable in 0.73 (we had already removed the dependency on crd::v1beta1).

nightkr · 2022-06-03T11:27:09Z

Didn't the attribute use to still work as long as you set it to v1?

clux · 2022-06-03T11:27:53Z

oh, hm... yes. damnit.

nightkr · 2022-06-03T11:28:53Z

Maybe it would make sense to just keep it around (but remove it from docs) for when v2 eventually shows up and we'll need to readd it anyway?

clux · 2022-06-03T11:30:46Z

Maybe, but I'd rather take a branch off 0.73 and cherry pick the two fixes for a 0.73.1 release.

EDIT: it's also not a clean backout if we were to remove it, since there's lots of v1beta1 stuff handled there.

nightkr · 2022-06-03T11:32:49Z

That also works. Will you do it or should I?

clux · 2022-06-03T11:33:20Z

If you have time, i'd appreciate it, otherwise i can look at it in the evening.

…#925) fix applier hangs which can happen with many watched objects and children When the applier scheduler channel buffer gets full, it will block preventing any further progress by the applier. This can be triggered with many objects which have many children that are being watched as the buffer fills up and progress is no longer made and re-queues can get to a point where they are never processed once the buffer is full. Signed-off-by: Moustafa Baiou <moustafab.ccit@gmail.com>

moustafab · 2022-06-03T11:38:33Z

@clux @teozkr Thanks so much for the quick turn around!

[0.73 backport] fix applier hangs which can happen with many watched objects (#925)

nightkr · 2022-06-03T12:52:59Z

@moustafab @clux This is now released as 0.73.1, and will also be a part of 0.74.0 when that drops. Thanks again!

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

moustafab force-pushed the moustafab/applier-hang branch from 3848060 to 0bf2f83 Compare June 3, 2022 04:47

This comment was marked as off-topic.

Sign in to view

clux added the changelog-fix changelog fix category for prs label Jun 3, 2022

clux added this to the 0.74.0 milestone Jun 3, 2022

clux approved these changes Jun 3, 2022

View reviewed changes

nightkr approved these changes Jun 3, 2022

View reviewed changes

clux merged commit c9e0c97 into kube-rs:master Jun 3, 2022

nightkr added bug Something isn't working runtime controller runtime related labels Jun 3, 2022

nightkr mentioned this pull request Jun 3, 2022

Applier hangs if schedule request buffer is full #926

Closed

nightkr mentioned this pull request Jun 3, 2022

[0.73 backport] fix applier hangs which can happen with many watched objects (#925) #927

Merged

nightkr added a commit that referenced this pull request Jun 3, 2022

Merge pull request #927 from teozkr/backport/0.73/applier-hang

29b2468

[0.73 backport] fix applier hangs which can happen with many watched objects (#925)

nightkr added a commit to nightkr/kube-rs that referenced this pull request Jun 8, 2022

Revert kube-rs#925 to prepare for a proper fix

6d6f7ff

nightkr mentioned this pull request Jun 8, 2022

Applier: Improve reconciler reschedule context to avoid deadlocking on full channel #932

Merged

nightkr added a commit to nightkr/kube-rs that referenced this pull request Jun 8, 2022

Revert kube-rs#925 to prepare for a proper fix

d0adfa0

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

moustafab deleted the moustafab/applier-hang branch May 9, 2023 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix applier hangs which can happen with many watched objects #925

fix applier hangs which can happen with many watched objects #925

moustafab commented Jun 3, 2022

This comment was marked as off-topic.

clux commented Jun 3, 2022

nightkr left a comment

nightkr commented Jun 3, 2022

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022 •

edited

Loading

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022 •

edited

Loading

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022

moustafab commented Jun 3, 2022

nightkr commented Jun 3, 2022

fix applier hangs which can happen with many watched objects #925

fix applier hangs which can happen with many watched objects #925

Conversation

moustafab commented Jun 3, 2022

Motivation

Solution

This comment was marked as off-topic.

clux commented Jun 3, 2022

nightkr left a comment

Choose a reason for hiding this comment

nightkr commented Jun 3, 2022

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022 • edited Loading

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022 • edited Loading

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022

moustafab commented Jun 3, 2022

nightkr commented Jun 3, 2022

clux commented Jun 3, 2022 •

edited

Loading

clux commented Jun 3, 2022 •

edited

Loading