Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reset coalesceTimer to nil as soon as the event is consumed #11924

Merged
merged 6 commits into from
Jan 5, 2022

Conversation

dhiaayachi
Copy link
Contributor

This is fix #11923

@vercel vercel bot temporarily deployed to Preview – consul December 29, 2021 04:27 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 29, 2021 04:27 Inactive
@dhiaayachi dhiaayachi requested review from freddygv and a team December 29, 2021 04:33
@vercel vercel bot temporarily deployed to Preview – consul January 4, 2022 16:42 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 4, 2022 16:42 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 4, 2022 16:43 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 4, 2022 16:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 4, 2022 16:58 Inactive
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! I'm surprised this didn't come up during the scale tests where the issue from #9689 came up.

@@ -294,6 +294,8 @@ func (s *state) run(ctx context.Context, snap *ConfigSnapshot) {
}

case <-sendCh:
// Allow the next change to trigger a send
coalesceTimer = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is reset here, wouldn't the check in L322 always be true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this timer is doing any more. The original intent was to prevent updates to multiple blocking queries that return at the same time from causing multiple reconfigurations of Envoy in quick succession.

For example on startup, a proxy might start watching 100 upstream services and config entries, without coalescing here, we would end up delivering O(100) entire snapshot reconfigurations to Envoy in the space of a few hundred milliseconds as all the blocking queries return their initial results.

It's not clear to me if this is actually doing that any more or if immediately resetting it here is causing it to effectively be removed? I've not traced this code for long enough to fully remember how it works though.

The other unknown for me is whether this coalescing is even needed/useful anymore since we implemented Delta XDS? My understanding is that now, even if we deliver multiple snapshots, the delta part of delta XDS will ensure we only send the changes in each one down so from Envoy's perspective it might not make any difference. It still could save some CPU cycles on the agent that is doing the delivering though which as we move towards Servers doing that could be significant. It's not clear that they will even use this package as it is though so I perhaps wouldn't optimize too much for that.

I'd be interested to know at any rate whether this change has a material impact on the amount of reconfigurations sent to envoy when there are a lot of upstreams, and if not, whether we should just remove all the coallesce code entirely and simplify this loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. I get it now. I missed the main place the timer is reset outside of the Select in the for loop way below (not in the default branch). Originally before the last deadlock fix it was always reset in this branch anyway: https://github.com/hashicorp/consul/blame/cf9a14ab6ad25e8932eb2b07616c0c33ddc54b12/agent/proxycfg/state.go#L614-L631

I think Freddy might be right that the conditional in the default case on 322 is redundant as it will always be true again now in that case, but this doesn't break all of coalescing because there is still the conditional outside the select that is the main place we throttle updates being delivered to sendCh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right the check is not needed

The use case I found the bug from have a lot of updates happening (specially for an ingress Kind) not sure why but saw a lot of updates happening per second for the same ingress service, I would guess that the coalesceTimeout is useful for this type of behaviours.

agent/proxycfg/manager_test.go Outdated Show resolved Hide resolved
agent/proxycfg/manager_test.go Outdated Show resolved Hide resolved
agent/proxycfg/manager_test.go Outdated Show resolved Hide resolved

// send two snapshots back to back without reading them to overflow the snapshot channel and get to the default use case
for i := 0; i < 2; i++ {
time.Sleep(250 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this sleep used for? Could it be removed to reduce the running time for the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sleep is to ensure that we saturate the snap channel and get to the default. because of the check in http://github.com/hashicorp/consul/blob/f68b1a4a77741ae4c293571efd32b943971ec9ab/agent/proxycfg/state.go#L357-L357 if we don't wait between updates we only send once which do not fill snapCh and get us to the default case.

Co-authored-by: Freddy <freddygv@users.noreply.github.com>
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 5, 2022 15:28 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 5, 2022 15:28 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 5, 2022 15:29 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 5, 2022 15:29 Inactive
@dhiaayachi dhiaayachi merged commit e653f81 into main Jan 5, 2022
@dhiaayachi dhiaayachi deleted the xds_Timer_deadlock branch January 5, 2022 17:17
@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/540755.

@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/540783.

@hc-github-team-consul-core
Copy link
Contributor

🍒✅ Cherry pick of commit e653f81 onto release/1.11.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Jan 5, 2022
* reset `coalesceTimer` to nil as soon as the event is consumed

* add change log

* refactor to add relevant test.

* fix linter

* Apply suggestions from code review

Co-authored-by: Freddy <freddygv@users.noreply.github.com>

* remove non needed check

Co-authored-by: Freddy <freddygv@users.noreply.github.com>
@hc-github-team-consul-core
Copy link
Contributor

🍒✅ Cherry pick of commit e653f81 onto release/1.10.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Jan 5, 2022
* reset `coalesceTimer` to nil as soon as the event is consumed

* add change log

* refactor to add relevant test.

* fix linter

* Apply suggestions from code review

Co-authored-by: Freddy <freddygv@users.noreply.github.com>

* remove non needed check

Co-authored-by: Freddy <freddygv@users.noreply.github.com>
@hc-github-team-consul-core
Copy link
Contributor

🍒✅ Cherry pick of commit e653f81 onto release/1.9.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Jan 5, 2022
* reset `coalesceTimer` to nil as soon as the event is consumed

* add change log

* refactor to add relevant test.

* fix linter

* Apply suggestions from code review

Co-authored-by: Freddy <freddygv@users.noreply.github.com>

* remove non needed check

Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XDS snapshot updates are not correctly delivered after a first retry when certain conditions are met.
4 participants