Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

hamishforbes · 2020-08-03T10:38:41Z

I've just upgraded a test Kubernetes cluster to 1.8.1 from 1.7.1 and my agent instances are crashing over and over with a segfault

I'm not sure what call is triggering this, there are a number of services in the cluster using Consul Connect but without the consul agents running its hard to isolate the cause of this crash as... well everything is crashing and restarting as there's no agent available.

    2020-08-03T10:24:21.119Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x24b8dfc]

goroutine 131 [running]:
github.com/hashicorp/consul/agent/xds.(*Server).process(0xc00012a840, 0x37e3240, 0xc000656b70, 0xc000840660, 0x0, 0x0)
	/home/circleci/project/consul/agent/xds/server.go:334 +0x65c
github.com/hashicorp/consul/agent/xds.(*Server).StreamAggregatedResources(0xc00012a840, 0x37e3240, 0xc000656b70, 0x3781900, 0xc00012a840)
	/home/circleci/project/consul/agent/xds/server.go:171 +0xc5
github.com/envoyproxy/go-control-plane/envoy/service/discovery/v2._AggregatedDiscoveryService_StreamAggregatedResources_Handler(0x30f5040, 0xc00012a840, 0x37d91c0, 0xc0008a2180, 0x53c8e90, 0xc000733a00)
	/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.9.5/envoy/service/discovery/v2/ads.pb.go:197 +0xad
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000375380, 0x37e9240, 0xc000afa600, 0xc000733a00, 0xc000535e00, 0x5365e80, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:1211 +0xd1e
google.golang.org/grpc.(*Server).handleStream(0xc000375380, 0x37e9240, 0xc000afa600, 0xc000733a00, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:1291 +0xcd6
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000857240, 0xc000375380, 0x37e9240, 0xc000afa600, 0xc000733a00)
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:722 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.25.1/server.go:720 +0xa1

edit:

Managed to pin this down a bit more.
I can reliably reproduce the crash when there is any Consul Connect enabled service running (e.g. an Envoy proxy operating as a sidecar pulling config from the local agent via xDS) and the agent restarts.
I'm using consul-helm to setup my agents and the connect injector to setup Envoy sidecars.

Start consul agent
Start consul connect enabled service
Everything is fine
Restart consul agent
Agent segfaults when Envoy makes a gRPC call

I've tested this with Consul 1.7.1, 1.7.3, 1.7.4, 1.7.5 and 1.8.1 with the latest supported Envoy version for each (so 1.13.1 and 1.14.2, specifically the envoyproxy/envoy-alpine image variant)

Only 1.7.5 and 1.8.1 crash when the agent is restarted. 1.7.4 is fine.

The only commit I can see between 1.7.4 and 1.7.5 that looks relevant is #8266

On Consul versions that don't crash, and on versions that do crash which are already running when the Connect service starts, I do see a log entry. Not sure if this is relevant though, the consul-helm changelog has a note about an incorrect health check that will fail on 1.8.0, but i'm using consul-k8s:0.18.0 for all these tests.

2020-08-03T16:30:50.972Z [ERROR] agent.proxycfg: watch error: id=service-http-checks:api-gateway-7f4f97b79f-dhsqm-api-gateway error="invalid type for service checks response: cache.FetchResult, want: []structs.CheckType"

The text was updated successfully, but these errors were encountered:

hamishforbes · 2020-08-03T13:03:22Z

edit: disregard! edited into main comment

dnephin · 2020-08-05T16:57:13Z

The strange log message is reported as a bug in #7512, and fixed in master. I'll get the fix backported into a 1.8.x release.

The panic looks like it is on dereferencing req.Node, but I'm not sure how Node can be nil in that case. It'll require more investigation.

rboyer · 2020-08-05T19:27:26Z

From what I can tell it's related to this bug in envoy: envoyproxy/envoy#9682

We started setting the set_node_on_first_message_only flag to true in the bootstrap config generated with consul connect envoy in Consul 1.8.1 and 1.7.5. From what I can tell the following happens:

consul (pid A) gets an xDS DiscoveryRequest from envoy with the Node field populated.
consul (pid A) gets N>=0 more DiscoveryRequest from envoy with node NOT set (desired when set_node_on_first_message_only=true)
consul (pid A) is restarted and comes back as (pid B)
consul (pid B) gets its first xDS DiscoveryRequest from envoy the Node field not populated (not how set_node_on_first_message_only=true is supposed to work)

Until envoyproxy/envoy#9682 is fixed we'll have to go back to the less efficient configuration where we omitted the set_node_on_first_message_only flag and let it transmit the full Node structure on every request.

…ating envoy bootstrap config When consul is restarted and envoy that had already sent DiscoveryRequests to the previous consul process sends a request to the new process it doesn't respect the setting and never populates DiscoveryRequest.Node for the life of the new consul process due to this bug: envoyproxy/envoy#9682 Fixes #8430

…ating envoy bootstrap config (#8440) When consul is restarted and an envoy that had already sent DiscoveryRequests to the previous consul process sends a request to the new process it doesn't respect the setting and never populates DiscoveryRequest.Node for the life of the new consul process due to this bug: envoyproxy/envoy#9682 Fixes #8430

ghost added crash labels Aug 3, 2020

hamishforbes changed the title ~~Segfault in 1.8.1~~ Segfault in 1.8.x and 1.7.5 with Consul Connect Aug 3, 2020

dnephin added the theme/envoy/xds Related to Envoy support label Aug 5, 2020

rboyer mentioned this issue Aug 5, 2020

xds: revert setting set_node_on_first_message_only to true when generating envoy bootstrap config #8440

Merged

rboyer closed this as completed in #8440 Aug 5, 2020

dnephin mentioned this issue Dec 2, 2020

client upgrade (to 1.9.0): panic: runtime error: invalid memory address or nil pointer dereference #9317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

hamishforbes commented Aug 3, 2020 •

edited

Loading

hamishforbes commented Aug 3, 2020 •

edited

Loading

dnephin commented Aug 5, 2020 •

edited

Loading

rboyer commented Aug 5, 2020 •

edited

Loading

Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

Segfault in 1.8.x and 1.7.5 with Consul Connect #8430

Comments

hamishforbes commented Aug 3, 2020 • edited Loading

hamishforbes commented Aug 3, 2020 • edited Loading

dnephin commented Aug 5, 2020 • edited Loading

rboyer commented Aug 5, 2020 • edited Loading

hamishforbes commented Aug 3, 2020 •

edited

Loading

hamishforbes commented Aug 3, 2020 •

edited

Loading

dnephin commented Aug 5, 2020 •

edited

Loading

rboyer commented Aug 5, 2020 •

edited

Loading