-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset coalesceTimer
to nil as soon as the event is consumed
#11924
Conversation
a6f659a
to
c6c3784
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find! I'm surprised this didn't come up during the scale tests where the issue from #9689 came up.
@@ -294,6 +294,8 @@ func (s *state) run(ctx context.Context, snap *ConfigSnapshot) { | |||
} | |||
|
|||
case <-sendCh: | |||
// Allow the next change to trigger a send | |||
coalesceTimer = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is reset here, wouldn't the check in L322 always be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this timer is doing any more. The original intent was to prevent updates to multiple blocking queries that return at the same time from causing multiple reconfigurations of Envoy in quick succession.
For example on startup, a proxy might start watching 100 upstream services and config entries, without coalescing here, we would end up delivering O(100) entire snapshot reconfigurations to Envoy in the space of a few hundred milliseconds as all the blocking queries return their initial results.
It's not clear to me if this is actually doing that any more or if immediately resetting it here is causing it to effectively be removed? I've not traced this code for long enough to fully remember how it works though.
The other unknown for me is whether this coalescing is even needed/useful anymore since we implemented Delta XDS? My understanding is that now, even if we deliver multiple snapshots, the delta part of delta XDS will ensure we only send the changes in each one down so from Envoy's perspective it might not make any difference. It still could save some CPU cycles on the agent that is doing the delivering though which as we move towards Servers doing that could be significant. It's not clear that they will even use this package as it is though so I perhaps wouldn't optimize too much for that.
I'd be interested to know at any rate whether this change has a material impact on the amount of reconfigurations sent to envoy when there are a lot of upstreams, and if not, whether we should just remove all the coallesce code entirely and simplify this loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. I get it now. I missed the main place the timer is reset outside of the Select in the for loop way below (not in the default branch). Originally before the last deadlock fix it was always reset in this branch anyway: https://github.com/hashicorp/consul/blame/cf9a14ab6ad25e8932eb2b07616c0c33ddc54b12/agent/proxycfg/state.go#L614-L631
I think Freddy might be right that the conditional in the default case on 322 is redundant as it will always be true again now in that case, but this doesn't break all of coalescing because there is still the conditional outside the select that is the main place we throttle updates being delivered to sendCh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right the check is not needed
The use case I found the bug from have a lot of updates happening (specially for an ingress Kind) not sure why but saw a lot of updates happening per second for the same ingress service, I would guess that the coalesceTimeout
is useful for this type of behaviours.
|
||
// send two snapshots back to back without reading them to overflow the snapshot channel and get to the default use case | ||
for i := 0; i < 2; i++ { | ||
time.Sleep(250 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this sleep used for? Could it be removed to reduce the running time for the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sleep is to ensure that we saturate the snap channel and get to the default. because of the check in http://github.com/hashicorp/consul/blob/f68b1a4a77741ae4c293571efd32b943971ec9ab/agent/proxycfg/state.go#L357-L357 if we don't wait between updates we only send once which do not fill snapCh
and get us to the default
case.
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/540755. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/540783. |
🍒✅ Cherry pick of commit e653f81 onto |
* reset `coalesceTimer` to nil as soon as the event is consumed * add change log * refactor to add relevant test. * fix linter * Apply suggestions from code review Co-authored-by: Freddy <freddygv@users.noreply.github.com> * remove non needed check Co-authored-by: Freddy <freddygv@users.noreply.github.com>
🍒✅ Cherry pick of commit e653f81 onto |
* reset `coalesceTimer` to nil as soon as the event is consumed * add change log * refactor to add relevant test. * fix linter * Apply suggestions from code review Co-authored-by: Freddy <freddygv@users.noreply.github.com> * remove non needed check Co-authored-by: Freddy <freddygv@users.noreply.github.com>
🍒✅ Cherry pick of commit e653f81 onto |
* reset `coalesceTimer` to nil as soon as the event is consumed * add change log * refactor to add relevant test. * fix linter * Apply suggestions from code review Co-authored-by: Freddy <freddygv@users.noreply.github.com> * remove non needed check Co-authored-by: Freddy <freddygv@users.noreply.github.com>
This is fix #11923