-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset coalesceTimer
to nil as soon as the event is consumed
#11924
Changes from all commits
3d43c08
b046d8c
c6c3784
f68b1a4
473cc8a
b1ce653
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
```release-note:bug | ||
xds: fix a deadlock when the snapshot channel already have a snapshot to be consumed. | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -294,6 +294,8 @@ func (s *state) run(ctx context.Context, snap *ConfigSnapshot) { | |
} | ||
|
||
case <-sendCh: | ||
// Allow the next change to trigger a send | ||
coalesceTimer = nil | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is reset here, wouldn't the check in L322 always be true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand what this timer is doing any more. The original intent was to prevent updates to multiple blocking queries that return at the same time from causing multiple reconfigurations of Envoy in quick succession. For example on startup, a proxy might start watching 100 upstream services and config entries, without coalescing here, we would end up delivering O(100) entire snapshot reconfigurations to Envoy in the space of a few hundred milliseconds as all the blocking queries return their initial results. It's not clear to me if this is actually doing that any more or if immediately resetting it here is causing it to effectively be removed? I've not traced this code for long enough to fully remember how it works though. The other unknown for me is whether this coalescing is even needed/useful anymore since we implemented Delta XDS? My understanding is that now, even if we deliver multiple snapshots, the delta part of delta XDS will ensure we only send the changes in each one down so from Envoy's perspective it might not make any difference. It still could save some CPU cycles on the agent that is doing the delivering though which as we move towards Servers doing that could be significant. It's not clear that they will even use this package as it is though so I perhaps wouldn't optimize too much for that. I'd be interested to know at any rate whether this change has a material impact on the amount of reconfigurations sent to envoy when there are a lot of upstreams, and if not, whether we should just remove all the coallesce code entirely and simplify this loop? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nevermind. I get it now. I missed the main place the timer is reset outside of the Select in the for loop way below (not in the default branch). Originally before the last deadlock fix it was always reset in this branch anyway: https://github.com/hashicorp/consul/blame/cf9a14ab6ad25e8932eb2b07616c0c33ddc54b12/agent/proxycfg/state.go#L614-L631 I think Freddy might be right that the conditional in the default case on 322 is redundant as it will always be true again now in that case, but this doesn't break all of coalescing because there is still the conditional outside the select that is the main place we throttle updates being delivered to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that's right the check is not needed The use case I found the bug from have a lot of updates happening (specially for an ingress Kind) not sure why but saw a lot of updates happening per second for the same ingress service, I would guess that the |
||
// Make a deep copy of snap so we don't mutate any of the embedded structs | ||
// etc on future updates. | ||
snapCopy, err := snap.Clone() | ||
|
@@ -307,9 +309,6 @@ func (s *state) run(ctx context.Context, snap *ConfigSnapshot) { | |
case s.snapCh <- *snapCopy: | ||
s.logger.Trace("Delivered new snapshot to proxy config watchers") | ||
|
||
// Allow the next change to trigger a send | ||
coalesceTimer = nil | ||
|
||
// Skip rest of loop - there is nothing to send since nothing changed on | ||
// this iteration | ||
continue | ||
|
@@ -320,11 +319,9 @@ func (s *state) run(ctx context.Context, snap *ConfigSnapshot) { | |
s.logger.Trace("Failed to deliver new snapshot to proxy config watchers") | ||
|
||
// Reset the timer to retry later. This is to ensure we attempt to redeliver the updated snapshot shortly. | ||
if coalesceTimer == nil { | ||
coalesceTimer = time.AfterFunc(coalesceTimeout, func() { | ||
sendCh <- struct{}{} | ||
}) | ||
} | ||
coalesceTimer = time.AfterFunc(coalesceTimeout, func() { | ||
sendCh <- struct{}{} | ||
}) | ||
|
||
// Do not reset coalesceTimer since we just queued a timer-based refresh | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this sleep used for? Could it be removed to reduce the running time for the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sleep is to ensure that we saturate the snap channel and get to the default. because of the check in http://github.com/hashicorp/consul/blob/f68b1a4a77741ae4c293571efd32b943971ec9ab/agent/proxycfg/state.go#L357-L357 if we don't wait between updates we only send once which do not fill
snapCh
and get us to thedefault
case.