Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart captive core when a new version of core is detected on disk #3602

Closed
ire-and-curses opened this issue May 17, 2021 · 0 comments · Fixed by #3687
Closed

Restart captive core when a new version of core is detected on disk #3602

ire-and-curses opened this issue May 17, 2021 · 0 comments · Fixed by #3687

Comments

@ire-and-curses
Copy link
Member

What problem does your feature solve?

Horizon running in captive-core mode spawns stellar-core as a subprocess. If the stellar-core package is upgraded after Horizon is started the core binary changes on disk but the captive-core child process is NOT restarted.

This means that we end up in a situation where the running core version can be different from the version installed. This behaviour is confusing, non-obvious and hard to spot and can have unexpected impact on services. An unscheduled reboot can cause surprising behaviour some time after the update.

In the case of a network-wide protocol upgrade although the operator may have updated the captive-core version, the running captive-core might be a version of the core binary that does not support new protocol, leading to surprise downtime.

Forgetting to stop/start the binary has caught us out in our own deployments more than once so is likely a problem for others.

The behaviour of restarting services is standard in the Linux package world (see e.g. Postgres). To control upgrade timing, the best practice approach is to hold packages. We believe this should be familiar to operators but we can consider providing an informative message in the Horizon logs.

What would you like to see?

Horizon monitors captive-core on disk, and restarts captive-core if it detects a change (e.g. a more recent file timestamp for the captive-core binary). This was discussed internally at SDF (doc).

This is a break with previous behaviour. We should document this change.

There is a risk that some users will experience unscheduled downtime as a result. However, latest core release candidate (17.1.0rc1) shrinks core sync time from over 5 minutes to around 30s (see stellar/stellar-core#2960), so in this case downtime will be small. Also, auto-updating Horizon is a bad idea, and forgetting to hold the package is only a mistake you make once as an operator.

What alternatives are there?

  • Keep things the way they are
  • Push responsibility for detecting changes on disk to operators (who would need additional tooling, e.g. Puppet)
  • It might be possible to build a meta package that forces the dependency to update in a post-installation step
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants