Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xDS deadlock due to syncLoop termination. #20867

Merged
merged 2 commits into from
Mar 15, 2024

Conversation

hashi-derek
Copy link
Member

This fixes an issue where agentless xDS streams can deadlock permanently until a server is restarted. When this issue occurs, no new proxies are able to successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire syncLoop() terminates and stops consuming from the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the ConfigSource.cleanup() function never receiving a response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to process new xDS streams.


The fix to this issue involves removing the chan chan struct{} used like an RPC-over-channels pattern and replacing it with two distinct channels:

  • stopSyncLoopCh - indicates that the syncLoop() should terminate soon. + syncLoopDoneCh - indicates that the syncLoop() has terminated.

Splitting these two concepts out and deferring a close(syncLoopDoneCh) in the syncLoop() function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding syncLoop() whenever it encounters an irrecoverable error. This is done by hoisting the new syncLoopDoneCh upwards so that it's visible to the xDS delta processing. Prior to this fix, the behavior was to simply orphan them so they would never receive catalog-registration or service-defaults updates.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.
@hashi-derek hashi-derek requested a review from rboyer March 15, 2024 17:44
@github-actions github-actions bot added the theme/envoy/xds Related to Envoy support label Mar 15, 2024
@hashi-derek hashi-derek merged commit 0ac8ae6 into main Mar 15, 2024
85 checks passed
@hashi-derek hashi-derek deleted the derekm/NET-7834/fix-xds-deadlock branch March 15, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport theme/envoy/xds Related to Envoy support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants