Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset
coalesceTimer
to nil as soon as the event is consumed #11924reset
coalesceTimer
to nil as soon as the event is consumed #11924Changes from 4 commits
3d43c08
b046d8c
c6c3784
f68b1a4
473cc8a
b1ce653
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this sleep used for? Could it be removed to reduce the running time for the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sleep is to ensure that we saturate the snap channel and get to the default. because of the check in http://github.com/hashicorp/consul/blob/f68b1a4a77741ae4c293571efd32b943971ec9ab/agent/proxycfg/state.go#L357-L357 if we don't wait between updates we only send once which do not fill
snapCh
and get us to thedefault
case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is reset here, wouldn't the check in L322 always be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this timer is doing any more. The original intent was to prevent updates to multiple blocking queries that return at the same time from causing multiple reconfigurations of Envoy in quick succession.
For example on startup, a proxy might start watching 100 upstream services and config entries, without coalescing here, we would end up delivering O(100) entire snapshot reconfigurations to Envoy in the space of a few hundred milliseconds as all the blocking queries return their initial results.
It's not clear to me if this is actually doing that any more or if immediately resetting it here is causing it to effectively be removed? I've not traced this code for long enough to fully remember how it works though.
The other unknown for me is whether this coalescing is even needed/useful anymore since we implemented Delta XDS? My understanding is that now, even if we deliver multiple snapshots, the delta part of delta XDS will ensure we only send the changes in each one down so from Envoy's perspective it might not make any difference. It still could save some CPU cycles on the agent that is doing the delivering though which as we move towards Servers doing that could be significant. It's not clear that they will even use this package as it is though so I perhaps wouldn't optimize too much for that.
I'd be interested to know at any rate whether this change has a material impact on the amount of reconfigurations sent to envoy when there are a lot of upstreams, and if not, whether we should just remove all the coallesce code entirely and simplify this loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. I get it now. I missed the main place the timer is reset outside of the Select in the for loop way below (not in the default branch). Originally before the last deadlock fix it was always reset in this branch anyway: https://github.com/hashicorp/consul/blame/cf9a14ab6ad25e8932eb2b07616c0c33ddc54b12/agent/proxycfg/state.go#L614-L631
I think Freddy might be right that the conditional in the default case on 322 is redundant as it will always be true again now in that case, but this doesn't break all of coalescing because there is still the conditional outside the select that is the main place we throttle updates being delivered to
sendCh
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right the check is not needed
The use case I found the bug from have a lot of updates happening (specially for an ingress Kind) not sure why but saw a lot of updates happening per second for the same ingress service, I would guess that the
coalesceTimeout
is useful for this type of behaviours.