-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/backend: fatal if there is panic during defrag #17757
Conversation
ping @ahrtr @serathius @jmhbnz @siyuanfoundation request a review on this. hope it can help for failpoint test. |
server/storage/backend/backend.go
Outdated
defer func() { | ||
// NOTE: We should exit as soon as possible because that tx | ||
// might be closed. The inflight request might use invalid | ||
// tx and then panic as well. The real panic reason might be | ||
// shadowed by new panic. So, we should fatal here with lock. | ||
if rerr := recover(); rerr != nil { | ||
b.lg.Fatal("unexpected panic during defrag", zap.Any("panic", rerr)) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During defragmentation, we acquire all locks, so any TXs (including both batchTx and readTx) will never be closed by other operations, because all other operations will be blocked. The only exception is there may be some inflight concurrent read operations, but which will never close any TX either.
Also etcdserver will wait for all read TXs to finish when closing the bbolt db; so the defrag will not impact any read TXs either.
I agree printing the real panic is a little better so that it won't be shadowed by any other new panic. But I would suggest that we have a overall solution as mentioned in etcd-io/bbolt#715 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I would suggest that we have a overall solution as mentioned
I do have a plan. Still working on POC. Will file pull request for RFC when it's ready.
I agree printing the real panic is a little better so that it won't be shadowed by any other new panic.
Side note: The original issue is from robustness test suite. The failpoint was injected during serving requests, not in shutdown phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: The original issue is from robustness test suite. The failpoint was injected during serving requests, not in shutdown phase.
IN real production environment, the only possible reason for panicking a defragmentation is system error, e.g. power-off, OOM, I/O issue etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES. So, in this patch, fatal within lock is kind of fail fast. :) Even if we have better drain solution, it's still possible to use nil pointer during handling other inflight requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not to b.lg.Fatal
- It calls
os.Exit(1)
to terminate the process immediately. Deferred functions won't get executed, so any cleanup tasks won't get executed. - As mentioned above, during defragmentation, we acquires all locks, so defragmentation won't conflict with any other operations. The TXN won't be closed by any other opeartion/goroutine during defragmentation.
- Also I don't see any relationship between this PR and the issue UnsafeRange panicking during shutdown #17223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fatal will call os.Exit(1) with flushing pending log. I think it's better than os.Exit(1)
here.
Also I don't see any relationship between this PR and the issue #17223
Yes. It's about bbolt#715. I was thinking that it's related to termination phase so I linked it.
Sorry for that confusion.
As mentioned above, during defragmentation, we acquires all locks, so defragmentation won't conflict with any other operations. The TXN won't be closed by any other opeartion/goroutine during defragmentation.
Yes. However, if there is panic, all the defer functions will be called and the lock will be released. If there is any pending lock acquire function, it can get the lock and do query. It depends on goroutine schedule. Before exit caused by panic, other goroutines stil have chance to execute somethings. It depends on kernel or go runtime.
Based on bbolt#715 comment, the failpoint make the original db reference closed. The unsafe query will panic as well and shadow the original one.
We should exit as soon as possible if there is panic during defrag. Because that tx might be closed. The inflight request might use invalid tx and then panic as well. However, the real panic might be shadowed by new panic. It's related to goroutine schedule. So, we should fatal here. How to reproduce bbolt#715: ```diff diff --git a/server/etcdserver/api/v3rpc/maintenance.go b/server/etcdserver/api/v3rpc/maintenance.go index 4f55c7c74..0a91b4225 100644 --- a/server/etcdserver/api/v3rpc/maintenance.go +++ b/server/etcdserver/api/v3rpc/maintenance.go @@ -89,7 +89,13 @@ func NewMaintenanceServer(s *etcdserver.EtcdServer, healthNotifier notifier) pb. func (ms *maintenanceServer) Defragment(ctx context.Context, sr *pb.DefragmentRequest) (*pb.DefragmentResponse, error) { ms.lg.Info("starting defragment") ms.healthNotifier.defragStarted() - defer ms.healthNotifier.defragFinished() + defer func() { + ms.healthNotifier.defragFinished() + if rerr := recover(); rerr != nil { + time.Sleep(30 * time.Second) + panic(rerr) + } + }() err := ms.bg.Backend().Defrag() if err != nil { ms.lg.Warn("failed to defragment", zap.Error(err)) ``` ```bash $ make gofail-enable $ make $ GOFAIL_HTTP="127.0.0.1:1234" bin/etcd $ New Terminal $ curl http://127.0.0.1:1234/defragBeforeRename -XPUT -d'panic()' $ bin/etcdctl defrag $ Old Terminal {"level":"info","ts":"2024-04-09T23:28:54.084434+0800","caller":"v3rpc/maintenance.go:90","msg":"starting defragment"} {"level":"info","ts":"2024-04-09T23:28:54.087363+0800","caller":"backend/backend.go:509","msg":"defragmenting","path":"default.etcd/member/snap/db","current-db-size-bytes":20480,"current-db-size":"20 kB","current-db-size-in-use-bytes":16384,"current-db-size-in-use":"16 kB"} {"level":"info","ts":"2024-04-09T23:28:54.091229+0800","caller":"v3rpc/health.go:62","msg":"grpc service status changed","service":"","status":"SERVING"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbb2553] goroutine 156 [running]: go.etcd.io/bbolt.(*Tx).Bucket(...) go.etcd.io/bbolt@v1.4.0-alpha.0/tx.go:112 go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc00061ac80, {0x12b7780, 0x1a73420}, {0x1a113e8, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1) go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233 go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7f6280613618?, 0xc00061ac80?}) go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000334b80, {0x7f6280613618, 0xc00061ac80}) go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47 go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000334b80, {0x12b77f0, 0xc00061ac80}) go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xf2 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000393c08) go.etcd.io/etcd/server/v3/etcdserver/server.go:2216 +0x105 go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0x0?) go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16 go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00002df70) go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000393c08) go.etcd.io/etcd/server/v3/etcdserver/server.go:2259 +0xa8 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1() go.etcd.io/etcd/server/v3/etcdserver/server.go:2440 +0x59 created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1 go.etcd.io/etcd/server/v3/etcdserver/server.go:2438 +0xf9 ``` REF: etcd-io/bbolt#715 Signed-off-by: Wei Fu <fuweid89@gmail.com>
c8e050b
to
c438fcb
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 21 files with indirect coverage changes @@ Coverage Diff @@
## main #17757 +/- ##
==========================================
- Coverage 68.80% 68.76% -0.04%
==========================================
Files 420 420
Lines 35524 35527 +3
==========================================
- Hits 24441 24430 -11
- Misses 9656 9665 +9
- Partials 1427 1432 +5 Continue to review full report in Codecov by Sentry.
|
kindly ping @ahrtr @serathius updated PTAL thanks |
ping @serathius would you please take a look on that? I think it can reduce confusion for panic information. |
Is this is needed after #18822 ? What do we get from recovering from panic? |
Yes. L532 (in the following code) can panic by bbolt or failpoint. If panic, the etcd/server/storage/backend/backend.go Lines 528 to 534 in 14f4df4
There are several cases here.
This change doesn't change original panic information. It just fast-returns and exits. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, fuweid, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @fuweid for detailed explanation offline. Panic will execute all defers, while not imminently stoping other goroutines from running. This allows a window where we unlock mutex while txn is nil and other goroutines will panic. Proposed mitigation is simple and direct enough. |
We should exit as soon as possible if there is panic during defrag. Because that tx might be closed. The inflight request might use invalid tx and then panic as well. However, the real panic might be shadowed by new panic. It's related to goroutine schedule. So, we should fatal here.
How to reproduce bbolt#715:
REF: etcd-io/bbolt#715
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.