Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfail when consul transition from one leader to another #6104

Closed
tantra35 opened this issue Jul 10, 2019 · 3 comments · Fixed by #6115
Closed

Segfail when consul transition from one leader to another #6104

tantra35 opened this issue Jul 10, 2019 · 3 comments · Fixed by #6115
Assignees
Labels
needs-investigation The issue described is detailed and complex.
Milestone

Comments

@tantra35
Copy link

Consul 1.5.2

After memory pressure problems we upgrade instance type to more memory capacity, and after that when consul begin transition to new leaders we got Segfault with folow stack trace

     2019/07/10 10:25:46 [WARN]  raft: Failed to contact bd865752-edb2-9f57-4510-80a658384162 in 500.209509ms
     2019/07/10 10:25:46 [WARN]  raft: Failed to contact 946d1af0-8ea7-52ce-adfa-7ac172945eec in 509.403689ms
     2019/07/10 10:25:46 [WARN]  raft: Failed to contact bd865752-edb2-9f57-4510-80a658384162 in 510.463603ms
     2019/07/10 10:25:46 [WARN]  raft: Failed to contact quorum of nodes, stepping down
 panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14cc742]
 goroutine 261984 [running]:
 github.com/hashicorp/consul/agent/consul.(*Replicator).Stop(0xc000102d90)
         /consul/agent/consul/replication.go:152 +0xf2
 github.com/hashicorp/consul/agent/consul.(*Server).stopConfigReplication(...)
         /consul/agent/consul/leader.go:899
 github.com/hashicorp/consul/agent/consul.(*Server).revokeLeadership(0xc0003ebc00)
         /consul/agent/consul/leader.go:348 +0x67
 github.com/hashicorp/consul/agent/consul.(*Server).leaderLoop(0xc0003ebc00, 0xc0008ab560)
         /consul/agent/consul/leader.go:223 +0x80a
 github.com/hashicorp/consul/agent/consul.(*Server).monitorLeadership.func1(0xc0003b0260, 0xc0003ebc00, 0xc0008ab560)
         /consul/agent/consul/leader.go:80 +0x5b
 created by github.com/hashicorp/consul/agent/consul.(*Server).monitorLeadership
         /consul/agent/consul/leader.go:78 +0x414

@pearkes pearkes added the needs-investigation The issue described is detailed and complex. label Jul 10, 2019
@tantra35
Copy link
Author

tantra35 commented Jul 11, 2019

@pearkes We think that this happens due flapping between Start/Stop in Replicator (we have log verbosity WARN so miss log messages like [INFO] replication: started and [INFO] replication: stopped),

So assign cancel and ctx at Replicator object creation is wrong:
https://github.com/hashicorp/consul/blob/master/agent/consul/replication.go#L64

and ctx must by initialized at Start, not at Replicator object creation, so quick fix may looks like this

diff --git a/agent/consul/replication.go b/agent/consul/replication.go
index 789df61e6..75e03b4d1 100644
--- a/agent/consul/replication.go
+++ b/agent/consul/replication.go
@@ -61,7 +61,7 @@ func NewReplicator(config *ReplicatorConfig) (*Replicator, error) {
        if config.Logger == nil {
                config.Logger = log.New(os.Stderr, "", log.LstdFlags)
        }
-       ctx, cancel := context.WithCancel(context.Background())
+
        limiter := rate.NewLimiter(rate.Limit(config.Rate), config.Burst)

        maxWait := config.MaxRetryWait
@@ -77,8 +77,6 @@ func NewReplicator(config *ReplicatorConfig) (*Replicator, error) {
        return &Replicator{
                name:      config.Name,
                running:   false,
-               cancel:    cancel,
-               ctx:       ctx,
                limiter:   limiter,
                waiter:    waiter,
                replicate: config.ReplicateFn,
@@ -94,6 +92,12 @@ func (r *Replicator) Start() {
                return
        }

+       if r.cancel == nil {
+               ctx, cancel := context.WithCancel(context.Background())
+               r.cancel = cancel
+               r.ctx = ctx
+       }
+
        go r.run()

        r.running = true

@mkeeler
Copy link
Member

mkeeler commented Jul 11, 2019

@tantra35 You are exactly right. We aren't recreating the replicator after loosing leadership so we are leaving the nil cancel func around.

@mkeeler
Copy link
Member

mkeeler commented Jul 15, 2019

This made it into the second beta of 1.6.0 and will also be included in the final 1.6.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants