Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

BR won't clean up the environment when exit by SIGTERM #557

Closed
YuJuncen opened this issue Oct 16, 2020 · 3 comments · Fixed by #559
Closed

BR won't clean up the environment when exit by SIGTERM #557

YuJuncen opened this issue Oct 16, 2020 · 3 comments · Fixed by #559
Assignees
Labels
difficulty/1-easy Easy issue Priority/P0 Top priority issue. Must have an associated milestone type/bug Something isn't working

Comments

@YuJuncen
Copy link
Collaborator

YuJuncen commented Oct 16, 2020

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.
  • start BR (restore or backcup with --remove-schedulers)
  • waiting for the progress bar present, then press ctrl + c
  1. What did you expect to see?
    The cluster config changed by BR should be undone, since SIGTERM allows us to gracefully stop.

  2. What did you see instead?
    The cluster has stuck in the config that BR has set. (For current master, PD schedulers could be reset due to scheduler: use pause instead of remove schedulers #551 )

image

  1. What version of BR and TiDB/TiKV/PD are you using?

v4.0.7

Note:

We listen to signals here:

br/main.go

Lines 34 to 39 in d2d5bba

case syscall.SIGTERM:
cancel()
os.Exit(0)
default:
cancel()
os.Exit(1)

Canceling the context could make other goroutines eventually exit and clean up, but we leave no time for them.

Add a time.Sleep(30 * time.Second) remove those os.Exits could help. But there are still some problems:

br/pkg/task/backup.go

Lines 222 to 227 in d2d5bba

restore, e := mgr.RemoveSchedulers(ctx)
defer func() {
if restoreE := restore(ctx); restoreE != nil {
log.Warn("failed to restore removed schedulers, you may need to restore them manually", zap.Error(restoreE))
}
}()

We use the global context to do the cleanup tasks, which will always fail if the outer context is canceled. We should change it to a new context with a timeout, the timeout could be the same as the sleep time before stopping.

@YuJuncen YuJuncen added the type/bug Something isn't working label Oct 16, 2020
@kennytm kennytm added difficulty/1-easy Easy issue Priority/P0 Top priority issue. Must have an associated milestone labels Oct 16, 2020
@kennytm kennytm added this to the v5.0.0-alpha milestone Oct 16, 2020
@kennytm
Copy link
Collaborator

kennytm commented Oct 16, 2020

maybe just remove these two os.Exit?

@YuJuncen
Copy link
Collaborator Author

YuJuncen commented Oct 16, 2020

maybe just remove these two os.Exit?

Yep! Sleep isn't needed. But we will always exit with 0 (and we would get a debug log a little hard to read, maybe).

@kennytm
Copy link
Collaborator

kennytm commented Oct 17, 2020

But we will always exit with 0 (and we would get a debug log a little hard to read, maybe).

You exit with non-zero when err != nil.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
difficulty/1-easy Easy issue Priority/P0 Top priority issue. Must have an associated milestone type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants