Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: panic after raft applied incoming snapshot #4527

Closed
gyuho opened this issue Feb 15, 2016 · 6 comments
Closed

etcdserver: panic after raft applied incoming snapshot #4527

gyuho opened this issue Feb 15, 2016 · 6 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Feb 15, 2016

Mentioned at #4517.

tester:

2016/02/14 18:32:38 etcd-tester: [round#772 case#4] start failure kill one member for long time and expect it to recover from incoming snapshot
2016/02/14 18:32:38 etcd-tester: [round#772 case#4] start injecting failure..
2016/02/14 18:32:40 etcd-tester: [round#772 case#4] injected failure
2016/02/14 18:32:40 etcd-tester: [round#772 case#4] start recovering failure...
2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
2016/02/14 18:34:44 etcd-tester: [round#772 case#4] cleaning up...

agent2:

2016-02-14 18:32:42.349362 I | etcdserver: raft applied incoming snapshot at index 314332
unexpected fault address 0x7f502600900a
fatal error: fault
[signal 0x7 code=0x2 addr=0x7f502600900a pc=0x86b037]

goroutine 57 [running]:
runtime.throw(0xe8ba28, 0x5)
    /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc824c9d2e0 sp=0xc824c9d2c8
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:21 +0x10c fp=0xc824c9d330 sp=0xc824c9d2e0
github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt.(*freelist).read(0xc82504a090, 0x7f5026009000)
    /home/gyuho/go/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/freelist.go:162 +0x27 fp=0xc824c9d3d0 sp=0xc824c9d330
github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt.Open(0xc821be71e0, 0x19, 0x180, 0x1368500, 0xc821be7220, 0x0, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/db.go:216 +0x5eb fp=0xc824c9d518 sp=0xc824c9d3d0
github.com/coreos/etcd/storage/backend.newBackend(0xc821be71e0, 0x19, 0x5f5e100, 0x2710, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/storage/backend/backend.go:84 +0x73 fp=0xc824c9d5e8 sp=0xc824c9d518
github.com/coreos/etcd/storage/backend.NewDefaultBackend(0xc821be71e0, 0x19, 0x0, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/storage/backend/backend.go:80 +0x53 fp=0xc824c9d620 sp=0xc824c9d5e8
github.com/coreos/etcd/etcdserver.(*EtcdServer).applySnapshot(0xc82007a000, 0xc82038aa80, 0xc8200a6e70)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:590 +0x597 fp=0xc824c9d8c8 sp=0xc824c9d620
github.com/coreos/etcd/etcdserver.(*EtcdServer).applyAll(0xc82007a000, 0xc82038aa80, 0xc8200a6e70)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:551 +0x4d fp=0xc824c9de40 sp=0xc824c9d8c8
github.com/coreos/etcd/etcdserver.(*EtcdServer).run.func2(0x7f5536a2d5e0, 0xc82038aa40)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:532 +0x32 fp=0xc824c9de60 sp=0xc824c9de40
github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc820228ba0)
    /home/gyuho/go/src/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0x322 fp=0xc824c9df98 sp=0xc824c9de60
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1 fp=0xc824c9dfa0 sp=0xc824c9df98
created by github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
    /home/gyuho/go/src/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x275

goroutine 1 [chan receive]:
github.com/coreos/etcd/etcdmain.Main()
    /home/gyuho/go/src/github.com/coreos/etcd/etcdmain/etcd.go:192 +0x195b
main.main()
    /home/gyuho/go/src/github.com/coreos/etcd/main.go:29 +0x14

Full logs: https://storage.googleapis.com/failure-archive/agent2/2016-02-14T18%3A34%3A44Z/etcd.log

  1. agent1 log: https://storage.googleapis.com/failure-archive/agent1/2016-02-14T18%3A34%3A44Z/etcd.log
  2. agent3 log: https://storage.googleapis.com/failure-archive/agent3/2016-02-14T18%3A34%3A44Z/etcd.log
@gyuho
Copy link
Contributor Author

gyuho commented Feb 16, 2016

@xiang90 There were other cases with the same panics. Here's another example with data-dir (timestamp is 2016-02-15T21:23:31Z):

  1. https://storage.googleapis.com/issue4527/agent1/etcd.log
  2. https://storage.googleapis.com/issue4527/agent2/etcd.log
  3. https://storage.googleapis.com/issue4527/agent3/etcd.log (this is the one that got panic)

@xiang90 xiang90 self-assigned this Feb 16, 2016
@xiang90
Copy link
Contributor

xiang90 commented Feb 16, 2016

@gyuho Thanks!

@gyuho
Copy link
Contributor Author

gyuho commented Feb 16, 2016

Related events (leader failure was happening):

2016/02/15 21:21:20 etcd-tester: [round#1107 case#0] start failure kill the leader for long time and expect it to recover from incoming snapshot
2016/02/15 21:21:20 etcd-tester: [round#1107 case#0] start injecting failure...
2016/02/15 21:21:20 etcd-tester: stopping 10.240.0.4:2378

@xiang90
Copy link
Contributor

xiang90 commented Feb 16, 2016

@gyuho I am pretty sure boltdb data is corrupted for some reason. I will open an issue in boltdb.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 16, 2016

Xiang filed an upstream issue here boltdb/bolt#513.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 17, 2016

Got one more panic overnight:

tester log (killing the leader agent2):

2016/02/17 08:14:14 etcd-tester: [round#587 case#0] start failure kill the leader for long time and expect it to recover from incoming snapshot
2016/02/17 08:14:14 etcd-tester: [round#587 case#0] start injecting failure...
2016/02/17 08:14:14 etcd-tester: stopping leader 10.240.0.3:2378
2016/02/17 08:14:20 etcd-tester: [round#587 case#0] injected failure
2016/02/17 08:14:20 etcd-tester: [round#587 case#0] start recovering failure...
2016/02/17 08:16:21 etcd-tester: [round#587 case#0] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
2016/02/17 08:16:21 etcd-tester: [round#587 case#0] cleaning up...

agent2 log:

2016-02-17 08:14:23.514270 I | raft: 984a40a518039d1a [term: 7] received a MsgHeartbeat message with higher term from 4c527f0852849c1a [term: 8]
2016-02-17 08:14:23.514298 I | raft: 984a40a518039d1a became follower at term 8
2016-02-17 08:14:23.514347 I | raft: raft.node: 984a40a518039d1a elected leader 4c527f0852849c1a at term 8
2016-02-17 08:14:23.514755 I | rafthttp: the connection with 80069f82adac4c07 became active
2016-02-17 08:14:23.524155 I | raft: found conflict at index 1147474 [existing term: 7, conflicting term: 8]
2016-02-17 08:14:23.524198 I | raft: replace the unstable entries from index 1147474
2016-02-17 08:14:23.617058 I | rafthttp: receiving database snapshot [index:1166483, from 4c527f0852849c1a] ...
2016-02-17 08:14:24.137291 I | etcdserver: start to snapshot (applied: 1155241, lastsnap: 1139683)
2016-02-17 08:14:24.147715 I | etcdserver: saved snapshot at index 1155241
2016-02-17 08:14:24.147910 I | etcdserver: compacted raft log at 1150241
2016-02-17 08:14:25.322312 I | snap: saved database snapshot to disk [total bytes: 289439744]
2016-02-17 08:14:25.322361 I | rafthttp: received and saved database snapshot [index: 1166483, from: 4c527f0852849c1a] successfully
2016-02-17 08:14:25.322516 I | raft: 984a40a518039d1a [commit: 1155241, lastindex: 1155241, lastterm: 8] starts to restore snapshot [index: 1166483, term: 8]
2016-02-17 08:14:25.322571 I | raft: log [committed=1155241, applied=1155241, unstable.offset=1155242, len(unstable.Entries)=0] starts to restore snapshot [index: 1166483, term: 8]
2016-02-17 08:14:25.322599 I | raft: 984a40a518039d1a restored progress of 4c527f0852849c1a [next = 1166484, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-02-17 08:14:25.322620 I | raft: 984a40a518039d1a restored progress of 80069f82adac4c07 [next = 1166484, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-02-17 08:14:25.322634 I | raft: 984a40a518039d1a restored progress of 984a40a518039d1a [next = 1166484, match = 1166483, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-02-17 08:14:25.322647 I | raft: 984a40a518039d1a [commit: 1166483] restored snapshot [index: 1166483, term: 8]
2016-02-17 08:14:25.324838 I | etcdserver: raft applied incoming snapshot at index 1166483
unexpected fault address 0x7fa1c140a00a
fatal error: fault
[signal 0x7 code=0x2 addr=0x7fa1c140a00a pc=0x86b297]

goroutine 101 [running]:
runtime.throw(0xe8ba28, 0x5)
    /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc827fb12e0 sp=0xc827fb12c8
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:21 +0x10c fp=0xc827fb1330 sp=0xc827fb12e0
github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt.(*freelist).read(0xc826750840, 0x7fa1c140a000)
    /home/gyuho/go/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/freelist.go:162 +0x27 fp=0xc827fb13d0 sp=0xc827fb1330
github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt.Open(0xc826d2bc20, 0x19, 0x180, 0x1368500, 0xc826d2bc40, 0x0, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/db.go:216 +0x5eb fp=0xc827fb1518 sp=0xc827fb13d0
github.com/coreos/etcd/storage/backend.newBackend(0xc826d2bc20, 0x19, 0x5f5e100, 0x2710, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/storage/backend/backend.go:84 +0x73 fp=0xc827fb15e8 sp=0xc827fb1518
github.com/coreos/etcd/storage/backend.NewDefaultBackend(0xc826d2bc20, 0x19, 0x0, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/storage/backend/backend.go:80 +0x53 fp=0xc827fb1620 sp=0xc827fb15e8
github.com/coreos/etcd/etcdserver.(*EtcdServer).applySnapshot(0xc820198000, 0xc825d32b80, 0xc820094bb0)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:590 +0x597 fp=0xc827fb18c8 sp=0xc827fb1620
github.com/coreos/etcd/etcdserver.(*EtcdServer).applyAll(0xc820198000, 0xc825d32b80, 0xc820094bb0)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:551 +0x4d fp=0xc827fb1e40 sp=0xc827fb18c8
github.com/coreos/etcd/etcdserver.(*EtcdServer).run.func2(0x7fa6c55493b8, 0xc825d32b40)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:532 +0x32 fp=0xc827fb1e60 sp=0xc827fb1e40
github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc820016de0)
    /home/gyuho/go/src/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0x322 fp=0xc827fb1f98 sp=0xc827fb1e60
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1 fp=0xc827fb1fa0 sp=0xc827fb1f98
created by github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
    /home/gyuho/go/src/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x275

goroutine 1 [chan receive]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants