Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/backend: fatal if there is panic during defrag #17757

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Apr 9, 2024

We should exit as soon as possible if there is panic during defrag. Because that tx might be closed. The inflight request might use invalid tx and then panic as well. However, the real panic might be shadowed by new panic. It's related to goroutine schedule. So, we should fatal here.

How to reproduce bbolt#715:

diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go
index 4f55c7c74..0a91b4225 100644
--- a/server/etcdserver/api/v3rpc/maintenance.go
+++ b/server/etcdserver/api/v3rpc/maintenance.go
@@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb.
 func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) {
        ms.lg.Info("starting defragment")
        ms.healthNotifier.defragStarted()
-       defer ms.healthNotifier.defragFinished()
+       defer func() {
+               ms.healthNotifier.defragFinished()
+               if rerr := recover(); rerr != nil {
+                       time.Sleep(30 * time.Second)
+                       panic(rerr)
+               }
+       }()
        err := ms.bg.Backend().Defrag()
        if err != nil {
                ms.lg.Warn("failed to defragment", zap.Error(err))
$ make gofail-enable
$ make
$ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd

$ New Terminal
$ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()'
$ bin/etcdctl defrag

$ Old Terminal
{"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"}
{"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"}
{"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553]

goroutine 156 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9

REF: etcd-io/bbolt#715

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@fuweid
Copy link
Member Author

fuweid commented Apr 12, 2024

ping @ahrtr @serathius @jmhbnz @siyuanfoundation request a review on this. hope it can help for failpoint test.

Comment on lines 480 to 488
defer func() {
// NOTE: We should exit as soon as possible because that tx
// might be closed. The inflight request might use invalid
// tx and then panic as well. The real panic reason might be
// shadowed by new panic. So, we should fatal here with lock.
if rerr := recover(); rerr != nil {
b.lg.Fatal("unexpected panic during defrag", zap.Any("panic", rerr))
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During defragmentation, we acquire all locks, so any TXs (including both batchTx and readTx) will never be closed by other operations, because all other operations will be blocked. The only exception is there may be some inflight concurrent read operations, but which will never close any TX either.

Also etcdserver will wait for all read TXs to finish when closing the bbolt db; so the defrag will not impact any read TXs either.

I agree printing the real panic is a little better so that it won't be shadowed by any other new panic. But I would suggest that we have a overall solution as mentioned in etcd-io/bbolt#715 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I would suggest that we have a overall solution as mentioned

I do have a plan. Still working on POC. Will file pull request for RFC when it's ready.

I agree printing the real panic is a little better so that it won't be shadowed by any other new panic.

Side note: The original issue is from robustness test suite. The failpoint was injected during serving requests, not in shutdown phase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: The original issue is from robustness test suite. The failpoint was injected during serving requests, not in shutdown phase.

IN real production environment, the only possible reason for panicking a defragmentation is system error, e.g. power-off, OOM, I/O issue etc.

Copy link
Member Author

@fuweid fuweid Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES. So, in this patch, fatal within lock is kind of fail fast. :) Even if we have better drain solution, it's still possible to use nil pointer during handling other inflight requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not to b.lg.Fatal

  • It calls os.Exit(1) to terminate the process immediately. Deferred functions won't get executed, so any cleanup tasks won't get executed.
  • As mentioned above, during defragmentation, we acquires all locks, so defragmentation won't conflict with any other operations. The TXN won't be closed by any other opeartion/goroutine during defragmentation.
  • Also I don't see any relationship between this PR and the issue UnsafeRange panicking during shutdown #17223

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatal will call os.Exit(1) with flushing pending log. I think it's better than os.Exit(1) here.

Also I don't see any relationship between this PR and the issue #17223

Yes. It's about bbolt#715. I was thinking that it's related to termination phase so I linked it.
Sorry for that confusion.

As mentioned above, during defragmentation, we acquires all locks, so defragmentation won't conflict with any other operations. The TXN won't be closed by any other opeartion/goroutine during defragmentation.

Yes. However, if there is panic, all the defer functions will be called and the lock will be released. If there is any pending lock acquire function, it can get the lock and do query. It depends on goroutine schedule. Before exit caused by panic, other goroutines stil have chance to execute somethings. It depends on kernel or go runtime.

Based on bbolt#715 comment, the failpoint make the original db reference closed. The unsafe query will panic as well and shadow the original one.

@serathius serathius self-requested a review April 19, 2024 11:23
We should exit as soon as possible if there is panic during defrag.
Because that tx might be closed. The inflight request might use invalid
tx and then panic as well. However, the real panic might be shadowed by
new panic. It's related to goroutine schedule. So, we should fatal here.

How to reproduce bbolt#715:

```diff
diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go
index 4f55c7c74..0a91b4225 100644
--- a/server/etcdserver/api/v3rpc/maintenance.go
+++ b/server/etcdserver/api/v3rpc/maintenance.go
@@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb.
 func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) {
        ms.lg.Info("starting defragment")
        ms.healthNotifier.defragStarted()
-       defer ms.healthNotifier.defragFinished()
+       defer func() {
+               ms.healthNotifier.defragFinished()
+               if rerr := recover(); rerr != nil {
+                       time.Sleep(30 * time.Second)
+                       panic(rerr)
+               }
+       }()
        err := ms.bg.Backend().Defrag()
        if err != nil {
                ms.lg.Warn("failed to defragment", zap.Error(err))
```

```bash
$ make gofail-enable
$ make
$ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd

$ New Terminal
$ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()'
$ bin/etcdctl defrag

$ Old Terminal
{"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"}
{"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"}
{"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553]

goroutine 156 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9
```

REF: etcd-io/bbolt#715

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@codecov-commenter
Copy link

codecov-commenter commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.76%. Comparing base (75274b2) to head (7cf70c0).
Report is 78 commits behind head on main.

Current head 7cf70c0 differs from pull request most recent head c438fcb

Please upload reports for the commit c438fcb to get more accurate results.

Files with missing lines Patch % Lines
server/storage/backend/backend.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/backend/backend.go 82.37% <33.33%> (-0.46%) ⬇️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #17757      +/-   ##
==========================================
- Coverage   68.80%   68.76%   -0.04%     
==========================================
  Files         420      420              
  Lines       35524    35527       +3     
==========================================
- Hits        24441    24430      -11     
- Misses       9656     9665       +9     
- Partials     1427     1432       +5     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75274b2...c438fcb. Read the comment docs.

---- 🚨 Try these New Features:

@fuweid
Copy link
Member Author

fuweid commented Nov 7, 2024

kindly ping @ahrtr @serathius updated PTAL thanks

@fuweid
Copy link
Member Author

fuweid commented Nov 19, 2024

ping @serathius would you please take a look on that? I think it can reduce confusion for panic information.

@serathius
Copy link
Member

Is this is needed after #18822 ? What do we get from recovering from panic?

@fuweid
Copy link
Member Author

fuweid commented Nov 21, 2024

Is this is needed after #18822

Yes. L532 (in the following code) can panic by bbolt or failpoint. If panic, the tx will be nil. If there are goroutines using tx, they will panic as well. However, the panic information is different from defrag, we might see the using nil pointer message. So this change is to exit etcd as soon as possible with correct panic information.

b.batchTx.unsafeCommit(true)
b.batchTx.tx = nil
// gofail: var defragBeforeCopy struct{}
err = defragdb(b.db, tmpdb, defragLimit)
if err != nil {
tmpdb.Close()

What do we get from recovering from panic?

There are several cases here.

  • bbolt file is corrupted and bbolt pkg throws panic.
  • injected failpoint panic information.
  • ...

This change doesn't change original panic information. It just fast-returns and exits.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius
Copy link
Member

Thanks @fuweid for detailed explanation offline. Panic will execute all defers, while not imminently stoping other goroutines from running. This allows a window where we unlock mutex while txn is nil and other goroutines will panic.

Proposed mitigation is simple and direct enough.

@serathius serathius merged commit f77fa90 into etcd-io:main Nov 22, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants