hashicorp-consul-server-0 pod is dying with SIGSEGV #9600

aluchko · 2021-01-20T19:03:34Z

I've encountered this issue a few times. I have consul deployed on my minikube cluster using the helm chart listed in the tutorial (though I also enabled syncCatalog).

Things seem to work well for a while, but then at some point, often after restarting minikube, the consul server will get into a crash loop

kubectl logs hashicorp-consul-server-0
==> Starting Consul agent...
           Version: '1.9.1'
           Node ID: '2b24045a-08bd-14cb-669c-4f6e4177a10d'
         Node name: 'hashicorp-consul-server-0'
        Datacenter: 'minidc' (Segment: '<all>')
            Server: true (Bootstrap: true)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 172.17.0.5 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

    2021-01-20T17:17:22.452Z [WARN]  agent: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
    2021-01-20T17:17:22.452Z [WARN]  agent: bootstrap = true: do not enable unless necessary
    2021-01-20T17:17:22.558Z [WARN]  agent.auto_config: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
    2021-01-20T17:17:22.558Z [WARN]  agent.auto_config: bootstrap = true: do not enable unless necessary
    2021-01-20T17:17:22.665Z [INFO]  agent.server.raft: restored from snapshot: id=8-16385-1611097981735
    2021-01-20T17:17:22.955Z [INFO]  agent.server.raft: initial configuration: index=18522 servers="[{Suffrage:Voter ID:2b24045a-08bd-14cb-669c-4f6e4177a10d Address:172.17.0.8:8300}]"
    2021-01-20T17:17:22.955Z [INFO]  agent.server.raft: entering follower state: follower="Node at 172.17.0.5:8300 [Follower]" leader=
    2021-01-20T17:17:22.956Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: hashicorp-consul-server-0.minidc 172.17.0.5
    2021-01-20T17:17:22.956Z [WARN]  agent.server.serf.wan: serf: Failed to re-join any previously known node
    2021-01-20T17:17:22.957Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: hashicorp-consul-server-0 172.17.0.5
    2021-01-20T17:17:22.957Z [INFO]  agent.router: Initializing LAN area manager
    2021-01-20T17:17:22.957Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: minikube: 172.17.0.4:8301
    2021-01-20T17:17:22.957Z [INFO]  agent.server: Handled event for server in area: event=member-join server=hashicorp-consul-server-0.minidc area=wan
    2021-01-20T17:17:22.957Z [INFO]  agent.server: Adding LAN server: server="hashicorp-consul-server-0 (Addr: tcp/172.17.0.5:8300) (DC: minidc)"
    2021-01-20T17:17:22.957Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
    2021-01-20T17:17:22.957Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
    2021-01-20T17:17:23.051Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: minikube 172.17.0.4
    2021-01-20T17:17:23.051Z [INFO]  agent: Starting server: address=[::]:8500 network=tcp protocol=http
    2021-01-20T17:17:23.150Z [WARN]  agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
    2021-01-20T17:17:23.052Z [INFO]  agent.server.serf.lan: serf: Re-joined to previously known node: minikube: 172.17.0.4:8301
    2021-01-20T17:17:23.150Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
    2021-01-20T17:17:23.250Z [INFO]  agent: Joining cluster...: cluster=LAN
    2021-01-20T17:17:23.250Z [INFO]  agent: (LAN) joining: lan_addresses=[hashicorp-consul-server-0.hashicorp-consul-server.default.svc:8301]
    2021-01-20T17:17:23.250Z [INFO]  agent: started state syncer
==> Consul agent running!
    2021-01-20T17:17:23.350Z [INFO]  agent: (LAN) joined: number_of_nodes=1
    2021-01-20T17:17:23.350Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=1
    2021-01-20T17:17:30.356Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
    2021-01-20T17:17:30.684Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
    2021-01-20T17:17:30.684Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.17.0.5:8300 [Candidate]" term=51
    2021-01-20T17:17:30.704Z [INFO]  agent.server.raft: election won: tally=1
    2021-01-20T17:17:30.704Z [INFO]  agent.server.raft: entering leader state: leader="Node at 172.17.0.5:8300 [Leader]"
    2021-01-20T17:17:30.704Z [INFO]  agent.server: cluster leadership acquired
    2021-01-20T17:17:30.705Z [INFO]  agent.server: New leader elected: payload=hashicorp-consul-server-0
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x7ad532]

goroutine 109 [running]:
github.com/hashicorp/go-immutable-radix.(*Iterator).Next(0xc000f73820, 0xc000f72fa0, 0x0, 0x0, 0xc000908fd0, 0x0, 0xffffffffffffffff)
	/go/pkg/mod/github.com/hashicorp/go-immutable-radix@v1.3.0/iter.go:178 +0xb2
github.com/hashicorp/go-memdb.(*radixIterator).Next(0xc000908d50, 0xc0006b4e20, 0x485b)
	/go/pkg/mod/github.com/hashicorp/go-memdb@v1.3.0/txn.go:895 +0x2e
github.com/hashicorp/consul/agent/consul/state.cleanupMeshTopology(0x38f5800, 0xc0006b4e20, 0x485b, 0xc000603b00, 0x485b, 0xc000603c70)
	/home/circleci/project/consul/agent/consul/state/catalog.go:3271 +0x36c
github.com/hashicorp/consul/agent/consul/state.(*Store).deleteServiceTxn(0xc00089e6c0, 0x38f5800, 0xc0006b4e20, 0x485b, 0xc000e75728, 0x8, 0xc00020c5a0, 0x49, 0xc000603c70, 0x0, ...)
	/home/circleci/project/consul/agent/consul/state/catalog.go:1542 +0x8c5
github.com/hashicorp/consul/agent/consul/state.(*Store).deleteNodeTxn(0xc00089e6c0, 0x38f5800, 0xc0006b4e20, 0x485b, 0xc000e75728, 0x8, 0xb25ddc, 0xc000f9b860)
	/home/circleci/project/consul/agent/consul/state/catalog.go:715 +0x62d
github.com/hashicorp/consul/agent/consul/state.(*Store).DeleteNode(0xc00089e6c0, 0x485b, 0xc000e75728, 0x8, 0x0, 0x0)
	/home/circleci/project/consul/agent/consul/state/catalog.go:648 +0xbb
github.com/hashicorp/consul/agent/consul/fsm.(*FSM).applyDeregister(0xc000c879e0, 0xc000979641, 0x3c, 0x3c, 0x485b, 0x0, 0x0)
	/home/circleci/project/consul/agent/consul/fsm/commands_oss.go:171 +0x41a
github.com/hashicorp/consul/agent/consul/fsm.NewFromDeps.func1(0xc000979641, 0x3c, 0x3c, 0x485b, 0xc0001376d0, 0xc000999c80)
	/home/circleci/project/consul/agent/consul/fsm/fsm.go:99 +0x56
github.com/hashicorp/consul/agent/consul/fsm.(*FSM).Apply(0xc000c879e0, 0xc000b98aa0, 0x0, 0x0)
	/home/circleci/project/consul/agent/consul/fsm/fsm.go:133 +0x1b6
github.com/hashicorp/go-raftchunking.(*ChunkingFSM).Apply(0xc000c8b740, 0xc000b98aa0, 0x5191aa0, 0xbffa374b43333586)
	/go/pkg/mod/github.com/hashicorp/go-raftchunking@v0.6.1/fsm.go:66 +0x5b
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000963050)
	/go/pkg/mod/github.com/hashicorp/raft@v1.2.0/fsm.go:90 +0x2c2
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000b96000, 0x40, 0x40)
	/go/pkg/mod/github.com/hashicorp/raft@v1.2.0/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc00025ec00)
	/go/pkg/mod/github.com/hashicorp/raft@v1.2.0/fsm.go:219 +0x3c4
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc00025ec00, 0xc000da3c80)
	/go/pkg/mod/github.com/hashicorp/raft@v1.2.0/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
	/go/pkg/mod/github.com/hashicorp/raft@v1.2.0/state.go:144 +0x66

The only way to fix this seems to be uninstalling consul via helm, deleting the persistent volume, and then reinstalling consul to the cluster.

$ helm uninstall hashicorp
$ kubectl delete -n default persistentvolumeclaim data-default-hashicorp-consul-server-0
persistentvolumeclaim "data-default-hashicorp-consul-server-0" deleted
$ kubectl delete persistentvolume pvc-066ace39-2807-4c53-8b1f-91cc6e9a5f51
persistentvolume "pvc-066ace39-2807-4c53-8b1f-91cc6e9a5f51" deleted
$ helm install hashicorp hashicorp/consul -f helm-consul-values.yaml

I saved the problematic hostpath-provisioner directory but I'm not sure about uploading since I don't know what data is contained in it.

The text was updated successfully, but these errors were encountered:

dnephin · 2021-01-20T19:16:01Z

Thank you for the bug report! This looks like the same issue as #9566. We just finished this bug and a release will be going out soon with the fix.

ghost added the crash label Jan 20, 2021

dnephin closed this as completed Jan 20, 2021

hc-github-team-consul-core assigned philrenaud Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hashicorp-consul-server-0 pod is dying with SIGSEGV #9600

hashicorp-consul-server-0 pod is dying with SIGSEGV #9600

aluchko commented Jan 20, 2021

dnephin commented Jan 20, 2021

hashicorp-consul-server-0 pod is dying with SIGSEGV #9600

hashicorp-consul-server-0 pod is dying with SIGSEGV #9600

Comments

aluchko commented Jan 20, 2021

dnephin commented Jan 20, 2021