Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition adding/updating listener #11833

Closed
lkysow opened this issue Dec 14, 2021 · 6 comments
Closed

Race condition adding/updating listener #11833

lkysow opened this issue Dec 14, 2021 · 6 comments
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected

Comments

@lkysow
Copy link
Member

lkysow commented Dec 14, 2021

Overview of the Issue

I can't reproduce this but creating a ticket to see if other people have run into the same issue.

One day I applied a proxy defaults config and my proxy went into this super tight loop where it was spamming errors and Consul was spamming these errors. Thousands of these messages every couple of seconds:

[ERROR] agent.envoy.xds: got error response from envoy proxy: service_id=frontend-sidecar-proxy \
typeUrl=type.googleapis.com/envoy.config.listener.v3.Listener xdsVersion=v3 nonce=000510fe \
error="rpc error: code = Internal desc = Error adding/updating listener(s) backend:127.0.0.1:6001: \
route: unknown cluster 'backend.default.dc1.internal.48b8d386-6855-4145-3bf2-90dd6fb29e96.consul

(newlines added for readability)

It looks like a race condition occurred where we were trying to XDS add a listener and expected a cluster to be there when it wasn't.

To fix it I had to restart the proxy.

Consul info for both Client and Server

Consul 1.10.4

@lkysow lkysow added theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected labels Dec 14, 2021
@fredwangwang
Copy link
Contributor

Hi @lkysow

we came across the same issue with Consul 1.10.3

We use nomad consul integration, after one particular job deployment (replace one service with another service), the ingress gateway started to spamming the error message:

{"@level":"error","@message":"got error response from envoy proxy","@module":"agent.envoy.xds.ingress_gateway","@timestamp":"2022-01-12T05:45:05.408164Z","error":"rpc error: code = Internal desc = route: unknown cluster '<new-service-name>.default.<dc-name>.internal.fe0699d0-86e3-6d4f-91d4-cefd95609e5e.consul'","nonce":"02a94654","service_id":"_nomad-task-08fc9334-9d1a-33db-784f-c53ee867bdd2-group-ingress-group-ingress-inbound","typeUrl":"type.googleapis.com/envoy.config.route.v3.RouteConfiguration","xdsVersion":"v3"}

Had to restart all ingress allocations after the issue happened.

@fredwangwang
Copy link
Contributor

@lkysow I was able to reliably reproduce:

  1. deploy ingress gateway (envoy sends initial cluster xds req, consul registers cluster handler as wildcard mode )
  2. restart consul agent
    1. this effectively disconnect envoy from consul xds, and when the consul comes back online again, envoy will reestablish xds request with non empty InitialResourceVersions and ResourceNamesSubscribe
    {"@level":"trace","@message":"Incremental xDS v3","@module":"agent.envoy.xds","@timestamp":"2022-01-18T17:00:03.024490Z","direction":"request","protobuf":"{\n  \"typeUrl\": \"type.googleapis.com/envoy.config.cluster.v3.Cluster\",\n  \"resourceNamesSubscribe\": [ NON-EMPTY-LIST-HERE ],\n  \"initialResourceVersions\": {\n    \"SOME-SERVICE.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul\": \"27bc809c5711d1b4c5a09958ae76d8268505827bb9cdc99ed094f6142a171700\" }\n}","xdsVersion":"v3"}
    
    1. consul wrongly? registers cluster handler as subscriptions (non wildcard mode) due to non empty ResourceNamesSubscribe
  3. deploy a new service that exposes through ingress gateway.
  4. Error: got error response from envoy proxy.... unknown cluster.
    1. this is due to the registered cluster handler for ingress is subscription based instead of wildcard, so it will not get any cluster information related to the newly added service.

It seems to me the all clusters (or endpoints, listeners..) updates info from cfg snap should make their way into the xds updates to envoy, since the cfg snap is already scoped to a particular sidecar/gateway. That would effectively mean using wildcard always for handlers.

I might be missing something, was there any reason it was not in such way?

@lkysow
Copy link
Member Author

lkysow commented Jan 19, 2022

@fredwangwang would you be able to provide a more detailed step by step reproduction with config files etc? I'm unable to reproduce this following your steps locally.

@fredwangwang
Copy link
Contributor

fredwangwang commented Jan 19, 2022

Hi @lkysow
here is a test case demostrate the failure:

func TestServer_DeltaAggregatedResources_v3_GetAllClusterAfterConsulRestarted(t *testing.T) {
	// This illustrates a scenario related to https://github.com/hashicorp/consul/issues/11833

	aclResolve := func(id string) (acl.Authorizer, error) {
		// Allow all
		return acl.RootAuthorizer("manage"), nil
	}
	scenario := newTestServerDeltaScenario(t, aclResolve, "web-sidecar-proxy", "", 0)
	_, mgr, errCh, envoy := scenario.server, scenario.mgr, scenario.errCh, scenario.envoy

	sid := structs.NewServiceID("web-sidecar-proxy", nil)

	// Register the proxy to create state needed to Watch() on
	mgr.RegisterProxy(t, sid)

	var snap *proxycfg.ConfigSnapshot
	runStep(t, "get into state after consul restarted", func(t *testing.T) {
		snap = newTestSnapshot(t, nil, "")

		// Send initial cluster discover.
		// This is to simulate the discovery request call from envoy after disconnected from consul ads stream.
		envoy.SendDeltaReq(t, ClusterType, &envoy_discovery_v3.DeltaDiscoveryRequest{
			ResourceNamesSubscribe: []string{
				"local_app",
				"db.default.dc1.internal.11111111-2222-3333-4444-555555555555.consul",
			},
			InitialResourceVersions: map[string]string{
				"local_app": "a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447",
				"db.default.dc1.internal.11111111-2222-3333-4444-555555555555.consul": "5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03",
			},
		})

		// Check no response sent yet
		assertDeltaChanBlocked(t, envoy.deltaStream.sendCh)

		requireProtocolVersionGauge(t, scenario, "v3", 1)

		// Deliver a new snapshot
		// the config contains 3 clusters: local_app, db, geo-cache.
		// this is to simulate the fact that there is one additional (geo-cache upstream) cluster gets added to the sidecar service
		// during the time xds disconnected (consul restarted).
		mgr.DeliverConfig(t, sid, snap)

		assertDeltaResponseSent(t, envoy.deltaStream.sendCh, &envoy_discovery_v3.DeltaDiscoveryResponse{
			TypeUrl: ClusterType,
			Nonce:   hexString(1),
			Resources: makeTestResources(t,
				makeTestCluster(t, snap, "tcp:local_app"),
				makeTestCluster(t, snap, "tcp:db"),
				makeTestCluster(t, snap, "tcp:geo-cache"),
			),
		})
	})

	envoy.Close()
	select {
	case err := <-errCh:
		require.NoError(t, err)
	case <-time.After(50 * time.Millisecond):
		t.Fatalf("timed out waiting for handler to finish")
	}
}

Since we used nomad & consul integration, our setup involves deploying nomad jobs, which is too lengthy to provide in this issue.

If you would like to see the issue in action, I think we can arrange some time with @YuZhao-Zhang 's help. Let me know.

rboyer added a commit that referenced this issue Jan 25, 2022
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream,
prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field
with the full list of currently subscribed items, instead of simply omitting it
to infer that it wanted everything (which is what wildcard mode means).

This upstream issue was filed in envoyproxy/envoy#16063 and fixed in
envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later
versions (later refactored in envoyproxy/envoy#16855).

This PR conditionally forces LDS/CDS to be wildcard-only even when the
connected Envoy requests a non-wildcard subscription, but only does so on
versions prior to `1.19.0`, as we should not need to do this on later versions.

This fixes the failure case as described here: #11833 (comment)

Co-authored-by: Huan Wang <fredwanghuan@gmail.com>
hc-github-team-consul-core pushed a commit that referenced this issue Jan 25, 2022
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream,
prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field
with the full list of currently subscribed items, instead of simply omitting it
to infer that it wanted everything (which is what wildcard mode means).

This upstream issue was filed in envoyproxy/envoy#16063 and fixed in
envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later
versions (later refactored in envoyproxy/envoy#16855).

This PR conditionally forces LDS/CDS to be wildcard-only even when the
connected Envoy requests a non-wildcard subscription, but only does so on
versions prior to `1.19.0`, as we should not need to do this on later versions.

This fixes the failure case as described here: #11833 (comment)

Co-authored-by: Huan Wang <fredwanghuan@gmail.com>
rboyer added a commit that referenced this issue Jan 25, 2022
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream,
prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field
with the full list of currently subscribed items, instead of simply omitting it
to infer that it wanted everything (which is what wildcard mode means).

This upstream issue was filed in envoyproxy/envoy#16063 and fixed in
envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later
versions (later refactored in envoyproxy/envoy#16855).

This PR conditionally forces LDS/CDS to be wildcard-only even when the
connected Envoy requests a non-wildcard subscription, but only does so on
versions prior to `1.19.0`, as we should not need to do this on later versions.

This fixes the failure case as described here: #11833 (comment)

Co-authored-by: Huan Wang <fredwanghuan@gmail.com>
rboyer added a commit that referenced this issue Jan 25, 2022
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream,
prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field
with the full list of currently subscribed items, instead of simply omitting it
to infer that it wanted everything (which is what wildcard mode means).

This upstream issue was filed in envoyproxy/envoy#16063 and fixed in
envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later
versions (later refactored in envoyproxy/envoy#16855).

This PR conditionally forces LDS/CDS to be wildcard-only even when the
connected Envoy requests a non-wildcard subscription, but only does so on
versions prior to `1.19.0`, as we should not need to do this on later versions.

This fixes the failure case as described here: #11833 (comment)

Co-authored-by: Huan Wang <fredwanghuan@gmail.com>

Co-authored-by: Huan Wang <fredwanghuan@gmail.com>
@rboyer
Copy link
Member

rboyer commented Jan 25, 2022

A version of the PR by @fredwangwang was adapted and merged for release in the next set of patch releases of 1.11.3 and 1.10.8: #12174

@fredwangwang
Copy link
Contributor

@rboyer thanks for getting the fix ready for the next release!

@blake blake closed this as completed Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants