Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POS-86: Graceful shutdown of Heimdall deamon, REST and Bridge #753

Merged
merged 19 commits into from
Dec 8, 2021

Conversation

momoshell
Copy link
Contributor

  • Graceful shutdown of Heimdall, REST, and Bridge are all initiated by the same context signaled by the OS on interrupt and program termination
  • Necessary code duplications are due to the private functions and types from Cosmos SDK that needed changes to accommodate shutdowns

@codecov-commenter
Copy link

codecov-commenter commented Dec 3, 2021

Codecov Report

Merging #753 (11e8dbd) into v2 (6a93c08) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##               v2     #753   +/-   ##
=======================================
  Coverage   72.88%   72.88%           
=======================================
  Files          49       49           
  Lines        3507     3507           
=======================================
  Hits         2556     2556           
  Misses        723      723           
  Partials      228      228           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a93c08...11e8dbd. Read the comment docs.

Copy link
Member

@vcastellm vcastellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git history seems to have garbage, can you rebase with master?

@sh3ll3x3c
Copy link

We'll most definitely have to squash the commits...

@momoshell
Copy link
Contributor Author

momoshell commented Dec 3, 2021

@vcastellm not sure about master though, since this was done through V2 fork, hence all the noise.
in the future, forks wont be used any more

@sh3ll3x3c agree, definitely

Copy link
Contributor

@ferranbt ferranbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, it turns out that this "small" change include a lot of code changes with respect other PRs. I would like to propose another way to handle the graceful shutdown.

Now, what you do is to have some "start" functions that setup the service (i.e. bridge, rest, daemon) and wait for it to finish. Then, you need to pass around the ctx so that they know when to stop.

Instead, something I usually do is for these "start" functions to only start (no wait to finish) and return some structs that export a Close function. Then, it is up to the functions creating and running the services (in this case, the cli commands) to start everything, wait for interrupt signal and Close every service. This Close function can be blocking and wait for it to finish everything.

@@ -28,6 +30,105 @@ const (
waitDuration = 1 * time.Minute
)

// StartBridgeWithCtx starts bridge service and is able to shutdow gracefully
// returns service errors, if any
func StartBridgeWithCtx(shutdownCtx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doing the same thing as StartBridge does? I see a lot of repeated code. If so, please merge both functions together.

Copy link
Contributor Author

@momoshell momoshell Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature is changed, StartBridgeWithCtx takes context instead of bool flag, and returns an error so it could be used with error groups properly. And the logic was changed so we would be provided with better overall control.

StartBridge was left there since it's being used with the package init function, wrapped inside of a GetStartCmd. I think this is a support for older commands.

Also, could you please explain what do we gain from returned structs? Are we really sure we need structs here? Could easily return Close function itself, but that's not the point here.
Passing context enables us to initiate shutdown across multiple running requests or commands, simultaneously.

Anyway, as docs say: the same Context may be passed to functions running in different goroutines; Contexts are safe for simultaneous use by multiple goroutines.
This is the exact thing we need here for a graceful shutdown.

Keep in mind that our services, that were affected by these changes, also block and wait for interrupt signals.

@@ -98,12 +102,14 @@ func GetSignerInfo(pub crypto.PubKey, priv []byte, cdc *codec.Codec) ValidatorAc

func main() {
cdc := app.MakeCodec()
ctx := server.NewDefaultContext()
serverCtx := server.NewDefaultContext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable change is creating a lot of new code.

Copy link
Contributor Author

@momoshell momoshell Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is passed to only two functions from the main itself, not sure I get this comment.

Also, we could merge server context and standard context with context.WithValue, where server context would be nested.

}

// ServeCommands will generate a long-running rest server
// (aka Light Client Daemon) that exposes functionality similar
// to the cli, but over rest
func ServeCommands(cdc *codec.Codec, registerRoutesFn func(*lcd.RestServer)) *cobra.Command {
func ServeCommands(shutdownCtx ctx.Context, cdc *codec.Codec, registerRoutesFn func(ctx client.CLIContext, mux *mux.Router)) *cobra.Command {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should pass this shutdownCtx to each command because the shutdown signal should be something specific to each execution of a command, this is, something to be created inside the Run command functions.

Could we do:

RunE: func(cmd *cobra.Command, args []string) error {
	shutdownCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
	defer stop()
 	err := StartRestServer(shutdownCtx, cdc, registerRoutesFn, restCh)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, each command would have created a context in its scope, where needed of course?

I could be wrong, but when we run our binary, each cmd should execute with the same context, since our main function is responsible for each execution of our defined commands?

Threading context down is a standard mechanism, not sure what I'm getting wrong here.

registerRoutesFn(rs)
const shutdownTimeout = 10 * time.Second

func StartRestServer(mainCtx ctx.Context, cdc *codec.Codec, registerRoutesFn func(ctx client.CLIContext, mux *mux.Router), restCh chan struct{}) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to call it mainCtx and not just ctx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done, fairly simple to change

@@ -66,11 +219,11 @@ func ServeCommands(cdc *codec.Codec, registerRoutesFn func(*lcd.RestServer)) *co
}

// RegisterRoutes register routes of all modules
func RegisterRoutes(rs *lcd.RestServer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this?

Copy link
Contributor Author

@momoshell momoshell Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That RestServer struct is imported from Cosmos SDK. For us, to be able to shut down the REST server properly, we needed access to the listener field, which is private in this case.
Not copying the RestServer struct from the package gave us control, and as an added bonus we got that logger that was also private.

Now, since the RestServer was not used in our logic for starting the server itself, the signature of RegisterRoutes was changed accordingly. And there is no need for a whole struct to be passed, since RegisterRoutes needs only two params.

@sh3ll3x3c
Copy link

sh3ll3x3c commented Dec 7, 2021

@ferranbt we've cleaned up what was discussed, should be good to go now...

@0xSasaPrsic
Copy link
Contributor

Hey @ferranbt we spotted a bug during heimdall initialization so will make that fix in this PR.

@0xSasaPrsic
Copy link
Contributor

@ferranbt we made fix to heimdall initialization so it should be ready for review again.

@momoshell momoshell merged commit 9d918a9 into maticnetwork:v2 Dec 8, 2021
@momoshell momoshell deleted the pos-86 branch December 8, 2021 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants