Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nil pointer dereference in bbolt 1.4.0-alpha.0 detected by robustness tests #715

Closed
serathius opened this issue Apr 5, 2024 · 3 comments

Comments

@serathius
Copy link
Member

https://github.com/etcd-io/etcd/actions/runs/8552395882/job/23433409009

024-04-04T10:09:46.3430327Z: panic: runtime error: invalid memory address or nil pointer dereference
2024-04-04T10:09:46.3432462Z: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2093]
2024-04-04T10:09:46.3435140Z: 
2024-04-04T10:09:46.3438693Z: goroutine 1566 [running]:
2024-04-04T10:09:46.3442546Z: go.etcd.io/bbolt.(*Tx).Bucket(...)
2024-04-04T10:09:46.3445048Z: 	go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112
2024-04-04T10:09:46.3448782Z: go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00150e690, {0x12b7700, 0x1a723c0}, {0xc000478060, 0x11, 0x12}, {0x0, 0x0, 0x0}, 0x0)
2024-04-04T10:09:46.3452428Z: 	go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
2024-04-04T10:09:46.3455617Z: go.etcd.io/etcd/server/v3/storage/mvcc.(*storeTxnCommon).rangeKeys(0xc0011662c0, {0x12b6ee8, 0xc000fce240}, {0xc001114040, 0x4, 0x8}, {0x0, 0x0, 0x0}, 0x27c, ...)
2024-04-04T10:09:46.3458120Z: 	go.etcd.io/etcd/server/v3/storage/mvcc/kvstore_txn.go:108 +0x6c6
2024-04-04T10:09:46.3460119Z: go.etcd.io/etcd/server/v3/storage/mvcc.(*storeTxnCommon).Range(...)
2024-04-04T10:09:46.3462441Z: 	go.etcd.io/etcd/server/v3/storage/mvcc/kvstore_txn.go:69
2024-04-04T10:09:46.3466458Z: go.etcd.io/etcd/server/v3/storage/mvcc.(*metricsTxnWrite).Range(0x10?, {0x12b6ee8?, 0xc000fce240?}, {0xc001114040?, 0xc0004a4300?, 0xc0009130d8?}, {0x0?, 0x30?, 0x10?}, {0x0, ...})
2024-04-04T10:09:46.3470534Z: 	go.etcd.io/etcd/server/v3/storage/mvcc/metrics_txn.go:41 +0x65
2024-04-04T10:09:46.3474341Z: go.etcd.io/etcd/server/v3/etcdserver/txn.executeRange({0x12b6ee8, 0xc000fce240}, 0xc00033ea00, {0x12b7658, 0xc00116a780}, 0xc000c04000)
2024-04-04T10:09:46.3478295Z: 	go.etcd.io/etcd/server/v3/etcdserver/txn/txn.go:171 +0x171
2024-04-04T10:09:46.3482021Z: go.etcd.io/etcd/server/v3/etcdserver/txn.Range({0x12b6ee8, 0xc000fce240}, 0xc00033ea00, {0x12c0aa0, 0xc000328be0}, 0xc000c04000)
2024-04-04T10:09:46.3485589Z: 	go.etcd.io/etcd/server/v3/etcdserver/txn/txn.go:143 +0x1ee
2024-04-04T10:09:46.3488734Z: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Range.func3()
2024-04-04T10:09:46.3491860Z: 	go.etcd.io/etcd/server/v3/etcdserver/v3_server.go:134 +0x92
2024-04-04T10:09:46.3495617Z: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc00038e308, {0x12b6ee8, 0xc000fce240}, 0xc0009134f0, 0xc0009134b8)
2024-04-04T10:09:46.3499253Z: 	go.etcd.io/etcd/server/v3/etcdserver/v3_server.go:728 +0x20e
2024-04-04T10:09:46.3502700Z: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Range(0xc00038e308, {0x12b6ee8, 0xc000fce1e0}, 0xc000c04000)
2024-04-04T10:09:46.3506107Z: 	go.etcd.io/etcd/server/v3/etcdserver/v3_server.go:135 +0x5ce
2024-04-04T10:09:46.3509795Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Range(0xc0002403c0, {0x12b6ee8?, 0xc000fce1e0?}, 0xc000c04000?)
2024-04-04T10:09:46.3513443Z: 	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/key.go:46 +0xfe
2024-04-04T10:09:46.3517326Z: go.etcd.io/etcd/api/v3/etcdserverpb._KV_Range_Handler.func1({0x12b6ee8?, 0xc000fce1e0?}, {0x10d0920?, 0xc000c04000?})
2024-04-04T10:09:46.3521243Z: 	go.etcd.io/etcd/api/v3@v3.6.0-alpha.0/etcdserverpb/rpc.pb.go:6620 +0xcb
2024-04-04T10:09:46.3525639Z: github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func2({0x12b6ee8, 0xc000fce1e0}, {0x10d0920, 0xc000c04000}, 0x110b61f?, 0xc001502150)
2024-04-04T10:09:46.3529856Z: 	github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0x7d
2024-04-04T10:09:46.3534041Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainUnaryServer.func5.1.1({0x12b6ee8?, 0xc000fce1e0?}, {0x10d0920?, 0xc000c04000?})
2024-04-04T10:09:46.3537912Z: 	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x34
2024-04-04T10:09:46.3541860Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.newUnaryInterceptor.func4({0x12b6ee8, 0xc000fce1e0}, {0x10d0920, 0xc000c04000}, 0xc000bb17a0?, 0xc00092e060)
2024-04-04T10:09:46.3544201Z: 	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/interceptor.go:74 +0x1f1
2024-04-04T10:09:46.3547410Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainUnaryServer.func5.1.1({0x12b6ee8?, 0xc000fce1e0?}, {0x10d0920?, 0xc000c04000?})
2024-04-04T10:09:46.3551301Z: 	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x34
2024-04-04T10:09:46.3555547Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.newLogUnaryInterceptor.func3({0x12b6ee8, 0xc000fce1e0}, {0x10d0920, 0xc000c04000}, 0xc00092e040, 0xc00092e080)
2024-04-04T10:09:46.3559914Z: 	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/interceptor.go:81 +0xb2
2024-04-04T10:09:46.3564002Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainUnaryServer.func5.1.1({0x12b6ee8?, 0xc000fce1e0?}, {0x10d0920?, 0xc000c04000?})
2024-04-04T10:09:46.3567948Z: 	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x34
2024-04-04T10:09:46.3572230Z: go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainUnaryServer.func5({0x12b6ee8, 0xc000fce1e0}, {0x10d0920, 0xc000c04000}, 0x4120e5?, 0x80?)
2024-04-04T10:09:46.3576286Z: 	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xb8
2024-04-04T10:09:46.3580119Z: go.etcd.io/etcd/api/v3/etcdserverpb._KV_Range_Handler({0xffd1e0, 0xc000240400}, {0x12b6ee8, 0xc000fce1e0}, 0xc000172180, 0xc000242780)
2024-04-04T10:09:46.3583732Z: 	go.etcd.io/etcd/api/v3@v3.6.0-alpha.0/etcdserverpb/rpc.pb.go:6622 +0x143
2024-04-04T10:09:46.3586189Z: google.golang.org/grpc.(*Server).processUnaryRPC(0xc000183400, {0x12b6ee8, 0xc000fce150}, {0x12bf7e0, 0xc00027cb60}, 0xc000ba8000, 0xc000242900, 0x1a7cce0, 0x0)
2024-04-04T10:09:46.3588935Z: 	google.golang.org/grpc@v1.62.1/server.go:1386 +0xdf8
2024-04-04T10:09:46.3590811Z: google.golang.org/grpc.(*Server).handleStream(0xc000183400, {0x12bf7e0, 0xc00027cb60}, 0xc000ba8000)
2024-04-04T10:09:46.3592710Z: 	google.golang.org/grpc@v1.62.1/server.go:1797 +0xe87
2024-04-04T10:09:46.3594369Z: google.golang.org/grpc.(*Server).serveStreams.func2.1()
2024-04-04T10:09:46.3596008Z: 	google.golang.org/grpc@v1.62.1/server.go:1027 +0x8b
2024-04-04T10:09:46.3598177Z: created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 296
2024-04-04T10:09:46.3599927Z: 	google.golang.org/grpc@v1.62.1/server.go:1038 +0x125

cc @ahrtr

@serathius
Copy link
Member Author

Might be an etcd issue with how bbolt is used, haven't investigated yet.

@ahrtr
Copy link
Member

ahrtr commented Apr 5, 2024

It's similar to etcd-io/etcd#17223.

During etcd shutting down, etcd may panic due to race condition (e.g. the backend transaction has already been closed) if

High level solution/idea... when terminating an etcd instance,

  • Firstly stop receiving any client & peer requests (e.g stop all the gRPC servers and http servers)
  • Wait for all the existing requests (either from client or from peer) to finish. We may also need a timeout, e.g. 30s.
  • Shutdown all internal services, e.g. the backend.

@serathius
Copy link
Member Author

Right, looks like duplicate of etcd-io/etcd#17223

fuweid added a commit to fuweid/etcd that referenced this issue Apr 9, 2024
We should exit as soon as possible if there is panic during defrag.
Because that tx might be closed. The inflight request might use invalid
tx and then panic as well. However, the real panic might be shadowed by
new panic. It's related to goroutine schedule. So, we should fatal here.

How to reproduce bbolt#715:

```diff
diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go
index 4f55c7c74..0a91b4225 100644
--- a/server/etcdserver/api/v3rpc/maintenance.go
+++ b/server/etcdserver/api/v3rpc/maintenance.go
@@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb.
 func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) {
        ms.lg.Info("starting defragment")
        ms.healthNotifier.defragStarted()
-       defer ms.healthNotifier.defragFinished()
+       defer func() {
+               ms.healthNotifier.defragFinished()
+               if rerr := recover(); rerr != nil {
+                       time.Sleep(30 * time.Second)
+                       panic(rerr)
+               }
+       }()
        err := ms.bg.Backend().Defrag()
        if err != nil {
                ms.lg.Warn("failed to defragment", zap.Error(err))
```

```bash
$ make gofail-enable
$ make
$ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd

$ New Terminal
$ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()'
$ bin/etcdctl defrag

$ Old Terminal
{"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"}
{"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"}
{"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553]

goroutine 156 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9
```

REF: etcd-io/bbolt#715

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/etcd that referenced this issue Nov 6, 2024
We should exit as soon as possible if there is panic during defrag.
Because that tx might be closed. The inflight request might use invalid
tx and then panic as well. However, the real panic might be shadowed by
new panic. It's related to goroutine schedule. So, we should fatal here.

How to reproduce bbolt#715:

```diff
diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go
index 4f55c7c74..0a91b4225 100644
--- a/server/etcdserver/api/v3rpc/maintenance.go
+++ b/server/etcdserver/api/v3rpc/maintenance.go
@@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb.
 func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) {
        ms.lg.Info("starting defragment")
        ms.healthNotifier.defragStarted()
-       defer ms.healthNotifier.defragFinished()
+       defer func() {
+               ms.healthNotifier.defragFinished()
+               if rerr := recover(); rerr != nil {
+                       time.Sleep(30 * time.Second)
+                       panic(rerr)
+               }
+       }()
        err := ms.bg.Backend().Defrag()
        if err != nil {
                ms.lg.Warn("failed to defragment", zap.Error(err))
```

```bash
$ make gofail-enable
$ make
$ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd

$ New Terminal
$ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()'
$ bin/etcdctl defrag

$ Old Terminal
{"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"}
{"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"}
{"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553]

goroutine 156 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9
```

REF: etcd-io/bbolt#715

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants