storage/backend: fatal if there is panic during defrag #17757

fuweid · 2024-04-09T15:44:43Z

We should exit as soon as possible if there is panic during defrag. Because that tx might be closed. The inflight request might use invalid tx and then panic as well. However, the real panic might be shadowed by new panic. It's related to goroutine schedule. So, we should fatal here.

How to reproduce bbolt#715:

diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go
index 4f55c7c74..0a91b4225 100644
--- a/server/etcdserver/api/v3rpc/maintenance.go
+++ b/server/etcdserver/api/v3rpc/maintenance.go
@@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb.
 func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) {
        ms.lg.Info("starting defragment")
        ms.healthNotifier.defragStarted()
-       defer ms.healthNotifier.defragFinished()
+       defer func() {
+               ms.healthNotifier.defragFinished()
+               if rerr := recover(); rerr != nil {
+                       time.Sleep(30 * time.Second)
+                       panic(rerr)
+               }
+       }()
        err := ms.bg.Backend().Defrag()
        if err != nil {
                ms.lg.Warn("failed to defragment", zap.Error(err))

$ make gofail-enable
$ make
$ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd

$ New Terminal
$ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()'
$ bin/etcdctl defrag

$ Old Terminal
{"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"}
{"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"}
{"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553]

goroutine 156 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9

REF: etcd-io/bbolt#715

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

fuweid · 2024-04-12T02:41:20Z

ping @ahrtr @serathius @jmhbnz @siyuanfoundation request a review on this. hope it can help for failpoint test.

ahrtr · 2024-04-12T08:36:11Z

server/storage/backend/backend.go

+	defer func() {
+		// NOTE: We should exit as soon as possible because that tx
+		// might be closed. The inflight request might use invalid
+		// tx and then panic as well. The real panic reason might be
+		// shadowed by new panic. So, we should fatal here with lock.
+		if rerr := recover(); rerr != nil {
+			b.lg.Fatal("unexpected panic during defrag", zap.Any("panic", rerr))
+		}
+	}()


During defragmentation, we acquire all locks, so any TXs (including both batchTx and readTx) will never be closed by other operations, because all other operations will be blocked. The only exception is there may be some inflight concurrent read operations, but which will never close any TX either.

Also etcdserver will wait for all read TXs to finish when closing the bbolt db; so the defrag will not impact any read TXs either.

I agree printing the real panic is a little better so that it won't be shadowed by any other new panic. But I would suggest that we have a overall solution as mentioned in etcd-io/bbolt#715 (comment)

But I would suggest that we have a overall solution as mentioned

I do have a plan. Still working on POC. Will file pull request for RFC when it's ready.

I agree printing the real panic is a little better so that it won't be shadowed by any other new panic.

Side note: The original issue is from robustness test suite. The failpoint was injected during serving requests, not in shutdown phase.

Side note: The original issue is from robustness test suite. The failpoint was injected during serving requests, not in shutdown phase.

IN real production environment, the only possible reason for panicking a defragmentation is system error, e.g. power-off, OOM, I/O issue etc.

YES. So, in this patch, fatal within lock is kind of fail fast. :) Even if we have better drain solution, it's still possible to use nil pointer during handling other inflight requests.

I suggest not to b.lg.Fatal

It calls os.Exit(1) to terminate the process immediately. Deferred functions won't get executed, so any cleanup tasks won't get executed.

As mentioned above, during defragmentation, we acquires all locks, so defragmentation won't conflict with any other operations. The TXN won't be closed by any other opeartion/goroutine during defragmentation.

Also I don't see any relationship between this PR and the issue UnsafeRange panicking during shutdown #17223

Fatal will call os.Exit(1) with flushing pending log. I think it's better than os.Exit(1) here.

Also I don't see any relationship between this PR and the issue #17223

Yes. It's about bbolt#715. I was thinking that it's related to termination phase so I linked it.
Sorry for that confusion.

As mentioned above, during defragmentation, we acquires all locks, so defragmentation won't conflict with any other operations. The TXN won't be closed by any other opeartion/goroutine during defragmentation.

Yes. However, if there is panic, all the defer functions will be called and the lock will be released. If there is any pending lock acquire function, it can get the lock and do query. It depends on goroutine schedule. Before exit caused by panic, other goroutines stil have chance to execute somethings. It depends on kernel or go runtime.

Based on bbolt#715 comment, the failpoint make the original db reference closed. The unsafe query will panic as well and shadow the original one.

We should exit as soon as possible if there is panic during defrag. Because that tx might be closed. The inflight request might use invalid tx and then panic as well. However, the real panic might be shadowed by new panic. It's related to goroutine schedule. So, we should fatal here. How to reproduce bbolt#715: ```diff diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go index 4f55c7c74..0a91b4225 100644 --- a/server/etcdserver/api/v3rpc/maintenance.go +++ b/server/etcdserver/api/v3rpc/maintenance.go @@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb. func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) { ms.lg.Info("starting defragment") ms.healthNotifier.defragStarted() - defer ms.healthNotifier.defragFinished() + defer func() { + ms.healthNotifier.defragFinished() + if rerr := recover(); rerr != nil { + time.Sleep(30 * time.Second) + panic(rerr) + } + }() err := ms.bg.Backend().Defrag() if err != nil { ms.lg.Warn("failed to defragment", zap.Error(err)) ``` ```bash $ make gofail-enable $ make $ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd $ New Terminal $ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()' $ bin/etcdctl defrag $ Old Terminal {"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"} {"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"} {"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553] goroutine 156 [running]: go.etcd.io/bbolt.(*Tx).Bucket(...) go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112 go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1) go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233 go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?}) go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80}) go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47 go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80}) go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08) go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105 go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?) go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16 go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70) go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08) go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1() go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59 created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1 go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9 ``` REF: etcd-io/bbolt#715 Signed-off-by: Wei Fu <fuweid89@gmail.com>

codecov-commenter · 2024-11-07T00:16:11Z

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.76%. Comparing base (75274b2) to head (7cf70c0).
Report is 78 commits behind head on main.

❗ Current head 7cf70c0 differs from pull request most recent head c438fcb

Please upload reports for the commit c438fcb to get more accurate results.

Files with missing lines	Patch %	Lines
server/storage/backend/backend.go	33.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
server/storage/backend/backend.go	`82.37% <33.33%> (-0.46%)`	⬇️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #17757      +/-   ##
==========================================
- Coverage   68.80%   68.76%   -0.04%     
==========================================
  Files         420      420              
  Lines       35524    35527       +3     
==========================================
- Hits        24441    24430      -11     
- Misses       9656     9665       +9     
- Partials     1427     1432       +5

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75274b2...c438fcb. Read the comment docs.

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

fuweid · 2024-11-07T02:22:20Z

kindly ping @ahrtr @serathius updated PTAL thanks

fuweid · 2024-11-19T22:31:25Z

ping @serathius would you please take a look on that? I think it can reduce confusion for panic information.

serathius · 2024-11-21T08:34:33Z

Is this is needed after #18822 ? What do we get from recovering from panic?

fuweid · 2024-11-21T18:53:49Z

Is this is needed after #18822

Yes. L532 (in the following code) can panic by bbolt or failpoint. If panic, the tx will be nil. If there are goroutines using tx, they will panic as well. However, the panic information is different from defrag, we might see the using nil pointer message. So this change is to exit etcd as soon as possible with correct panic information.

etcd/server/storage/backend/backend.go

Lines 528 to 534 in 14f4df4

    
           b.batchTx.unsafeCommit(true) 
        
           b.batchTx.tx = nil 
        
           // gofail: var defragBeforeCopy struct{} 
        
           err = defragdb(b.db, tmpdb, defragLimit) 
        
           if err != nil { 
        
           	tmpdb.Close()

What do we get from recovering from panic?

There are several cases here.

bbolt file is corrupted and bbolt pkg throws panic.
injected failpoint panic information.
...

This change doesn't change original panic information. It just fast-returns and exits.

k8s-ci-robot · 2024-11-22T22:35:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr,serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

serathius · 2024-11-22T22:38:42Z

Thanks @fuweid for detailed explanation offline. Panic will execute all defers, while not imminently stoping other goroutines from running. This allows a window where we unlock mutex while txn is nil and other goroutines will panic.

Proposed mitigation is simple and direct enough.

ahrtr reviewed Apr 12, 2024

View reviewed changes

fuweid mentioned this pull request Apr 15, 2024

UnsafeRange panicking during shutdown #17223

Open

4 tasks

ahrtr approved these changes Apr 19, 2024

View reviewed changes

serathius self-requested a review April 19, 2024 11:23

k8s-ci-robot added the needs-rebase label Nov 6, 2024

fuweid force-pushed the should-fatal-if-panic branch from c8e050b to c438fcb Compare November 6, 2024 23:57

k8s-ci-robot added approved size/S and removed needs-rebase labels Nov 6, 2024

serathius approved these changes Nov 22, 2024

View reviewed changes

serathius merged commit f77fa90 into etcd-io:main Nov 22, 2024
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/backend: fatal if there is panic during defrag #17757

storage/backend: fatal if there is panic during defrag #17757

fuweid commented Apr 9, 2024

fuweid commented Apr 12, 2024

ahrtr Apr 12, 2024

fuweid Apr 12, 2024

ahrtr Apr 12, 2024

fuweid Apr 14, 2024 •

edited

Loading

ahrtr Apr 18, 2024

fuweid Apr 19, 2024

codecov-commenter commented Nov 7, 2024 •

edited by codecov bot

Loading

fuweid commented Nov 7, 2024

fuweid commented Nov 19, 2024

serathius commented Nov 21, 2024

fuweid commented Nov 21, 2024

k8s-ci-robot commented Nov 22, 2024

serathius commented Nov 22, 2024

storage/backend: fatal if there is panic during defrag #17757

storage/backend: fatal if there is panic during defrag #17757

Conversation

fuweid commented Apr 9, 2024

fuweid commented Apr 12, 2024

ahrtr Apr 12, 2024

Choose a reason for hiding this comment

fuweid Apr 12, 2024

Choose a reason for hiding this comment

ahrtr Apr 12, 2024

Choose a reason for hiding this comment

fuweid Apr 14, 2024 • edited Loading

Choose a reason for hiding this comment

ahrtr Apr 18, 2024

Choose a reason for hiding this comment

fuweid Apr 19, 2024

Choose a reason for hiding this comment

codecov-commenter commented Nov 7, 2024 • edited by codecov bot Loading

Codecov Report

fuweid commented Nov 7, 2024

fuweid commented Nov 19, 2024

serathius commented Nov 21, 2024

fuweid commented Nov 21, 2024

k8s-ci-robot commented Nov 22, 2024

serathius commented Nov 22, 2024

fuweid Apr 14, 2024 •

edited

Loading

codecov-commenter commented Nov 7, 2024 •

edited by codecov bot

Loading