Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

Closed
hamishforbes opened this issue Aug 3, 2020 · 3 comments · Fixed by #8440
Closed

Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

hamishforbes opened this issue Aug 3, 2020 · 3 comments · Fixed by #8440
Labels
crash theme/envoy/xds Related to Envoy support

Comments

@hamishforbes
Copy link
Contributor

hamishforbes commented Aug 3, 2020

I've just upgraded a test Kubernetes cluster to 1.8.1 from 1.7.1 and my agent instances are crashing over and over with a segfault

I'm not sure what call is triggering this, there are a number of services in the cluster using Consul Connect but without the consul agents running its hard to isolate the cause of this crash as... well everything is crashing and restarting as there's no agent available.

    2020-08-03T10:24:21.119Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x24b8dfc]

goroutine 131 [running]:
github.com/hashicorp/consul/agent/xds.(*Server).process(0xc00012a840, 0x37e3240, 0xc000656b70, 0xc000840660, 0x0, 0x0)
	/home/circleci/project/consul/agent/xds/server.go:334 +0x65c
github.com/hashicorp/consul/agent/xds.(*Server).StreamAggregatedResources(0xc00012a840, 0x37e3240, 0xc000656b70, 0x3781900, 0xc00012a840)
	/home/circleci/project/consul/agent/xds/server.go:171 +0xc5
github.com/envoyproxy/go-control-plane/envoy/service/discovery/v2._AggregatedDiscoveryService_StreamAggregatedResources_Handler(0x30f5040, 0xc00012a840, 0x37d91c0, 0xc0008a2180, 0x53c8e90, 0xc000733a00)
	/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.9.5/envoy/service/discovery/v2/ads.pb.go:197 +0xad
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000375380, 0x37e9240, 0xc000afa600, 0xc000733a00, 0xc000535e00, 0x5365e80, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:1211 +0xd1e
google.golang.org/grpc.(*Server).handleStream(0xc000375380, 0x37e9240, 0xc000afa600, 0xc000733a00, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:1291 +0xcd6
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000857240, 0xc000375380, 0x37e9240, 0xc000afa600, 0xc000733a00)
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:722 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:720 +0xa1

edit:

Managed to pin this down a bit more.
I can reliably reproduce the crash when there is any Consul Connect enabled service running (e.g. an Envoy proxy operating as a sidecar pulling config from the local agent via xDS) and the agent restarts.
I'm using consul-helm to setup my agents and the connect injector to setup Envoy sidecars.

  • Start consul agent
  • Start consul connect enabled service
  • Everything is fine
  • Restart consul agent
  • Agent segfaults when Envoy makes a gRPC call

I've tested this with Consul 1.7.1, 1.7.3, 1.7.4, 1.7.5 and 1.8.1 with the latest supported Envoy version for each (so 1.13.1 and 1.14.2, specifically the envoyproxy/envoy-alpine image variant)

Only 1.7.5 and 1.8.1 crash when the agent is restarted. 1.7.4 is fine.

The only commit I can see between 1.7.4 and 1.7.5 that looks relevant is #8266

On Consul versions that don't crash, and on versions that do crash which are already running when the Connect service starts, I do see a log entry. Not sure if this is relevant though, the consul-helm changelog has a note about an incorrect health check that will fail on 1.8.0, but i'm using consul-k8s:0.18.0 for all these tests.

2020-08-03T16:30:50.972Z [ERROR] agent.proxycfg: watch error: id=service-http-checks:api-gateway-7f4f97b79f-dhsqm-api-gateway error="invalid type for service checks response: cache.FetchResult, want: []structs.CheckType"
@ghost ghost added crash labels Aug 3, 2020
@hamishforbes
Copy link
Contributor Author

hamishforbes commented Aug 3, 2020

edit: disregard! edited into main comment

@hamishforbes hamishforbes changed the title Segfault in 1.8.1 Segfault in 1.8.x and 1.7.5 with Consul Connect Aug 3, 2020
@dnephin dnephin added the theme/envoy/xds Related to Envoy support label Aug 5, 2020
@dnephin
Copy link
Contributor

dnephin commented Aug 5, 2020

The strange log message is reported as a bug in #7512, and fixed in master. I'll get the fix backported into a 1.8.x release.

The panic looks like it is on dereferencing req.Node, but I'm not sure how Node can be nil in that case. It'll require more investigation.

@rboyer
Copy link
Member

rboyer commented Aug 5, 2020

From what I can tell it's related to this bug in envoy: envoyproxy/envoy#9682

We started setting the set_node_on_first_message_only flag to true in the bootstrap config generated with consul connect envoy in Consul 1.8.1 and 1.7.5. From what I can tell the following happens:

  1. consul (pid A) gets an xDS DiscoveryRequest from envoy with the Node field populated.
  2. consul (pid A) gets N>=0 more DiscoveryRequest from envoy with node NOT set (desired when set_node_on_first_message_only=true)
  3. consul (pid A) is restarted and comes back as (pid B)
  4. consul (pid B) gets its first xDS DiscoveryRequest from envoy the Node field not populated (not how set_node_on_first_message_only=true is supposed to work)

Until envoyproxy/envoy#9682 is fixed we'll have to go back to the less efficient configuration where we omitted the set_node_on_first_message_only flag and let it transmit the full Node structure on every request.

rboyer added a commit that referenced this issue Aug 5, 2020
…ating envoy bootstrap config

When consul is restarted and envoy that had already sent
DiscoveryRequests to the previous consul process sends a request to the
new process it doesn't respect the setting and never populates
DiscoveryRequest.Node for the life of the new consul process due to this
bug: envoyproxy/envoy#9682

Fixes #8430
rboyer added a commit that referenced this issue Aug 5, 2020
…ating envoy bootstrap config (#8440)

When consul is restarted and an envoy that had already sent
DiscoveryRequests to the previous consul process sends a request to the
new process it doesn't respect the setting and never populates
DiscoveryRequest.Node for the life of the new consul process due to this
bug: envoyproxy/envoy#9682

Fixes #8430
hashicorp-ci pushed a commit that referenced this issue Aug 5, 2020
…ating envoy bootstrap config (#8440)

When consul is restarted and an envoy that had already sent
DiscoveryRequests to the previous consul process sends a request to the
new process it doesn't respect the setting and never populates
DiscoveryRequest.Node for the life of the new consul process due to this
bug: envoyproxy/envoy#9682

Fixes #8430
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crash theme/envoy/xds Related to Envoy support
Projects
None yet
3 participants