Fix xDS deadlock due to syncLoop termination. #20867
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes an issue where agentless xDS streams can deadlock permanently until a server is restarted. When this issue occurs, no new proxies are able to successfully connect to the server.
Effectively, the trigger for this deadlock stems from the following return statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202
When this happens, the entire
syncLoop()
terminates and stops consuming from the following channel:https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192
Which results in the
ConfigSource.cleanup()
function never receiving a response and holding a mutex indefinitely:https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247
Because this mutex is shared, it effectively deadlocks the server's ability to process new xDS streams.
The fix to this issue involves removing the
chan chan struct{}
used like an RPC-over-channels pattern and replacing it with two distinct channels:stopSyncLoopCh
- indicates that thesyncLoop()
should terminate soon. +syncLoopDoneCh
- indicates that thesyncLoop()
has terminated.Splitting these two concepts out and deferring a
close(syncLoopDoneCh)
in thesyncLoop()
function ensures that the deadlock above should no longer occur.We also now evict xDS connections of all proxies for the corresponding
syncLoop()
whenever it encounters an irrecoverable error. This is done by hoisting the newsyncLoopDoneCh
upwards so that it's visible to the xDS delta processing. Prior to this fix, the behavior was to simply orphan them so they would never receive catalog-registration or service-defaults updates.