Restart captive core when a new version of core is detected on disk #3602

ire-and-curses · 2021-05-17T18:09:21Z

What problem does your feature solve?

Horizon running in captive-core mode spawns stellar-core as a subprocess. If the stellar-core package is upgraded after Horizon is started the core binary changes on disk but the captive-core child process is NOT restarted.

This means that we end up in a situation where the running core version can be different from the version installed. This behaviour is confusing, non-obvious and hard to spot and can have unexpected impact on services. An unscheduled reboot can cause surprising behaviour some time after the update.

In the case of a network-wide protocol upgrade although the operator may have updated the captive-core version, the running captive-core might be a version of the core binary that does not support new protocol, leading to surprise downtime.

Forgetting to stop/start the binary has caught us out in our own deployments more than once so is likely a problem for others.

The behaviour of restarting services is standard in the Linux package world (see e.g. Postgres). To control upgrade timing, the best practice approach is to hold packages. We believe this should be familiar to operators but we can consider providing an informative message in the Horizon logs.

What would you like to see?

Horizon monitors captive-core on disk, and restarts captive-core if it detects a change (e.g. a more recent file timestamp for the captive-core binary). This was discussed internally at SDF (doc).

This is a break with previous behaviour. We should document this change.

There is a risk that some users will experience unscheduled downtime as a result. However, latest core release candidate (17.1.0rc1) shrinks core sync time from over 5 minutes to around 30s (see stellar/stellar-core#2960), so in this case downtime will be small. Also, auto-updating Horizon is a bad idea, and forgetting to hold the package is only a mistake you make once as an operator.

What alternatives are there?

Keep things the way they are
Push responsibility for detecting changes on disk to operators (who would need additional tooling, e.g. Puppet)
It might be possible to build a meta package that forces the dependency to update in a post-installation step

ire-and-curses added horizon feature request fast-txmeta labels May 17, 2021

bartekn added this to the Horizon 2.5.0 milestone May 27, 2021

tamirms self-assigned this Jun 10, 2021

tamirms mentioned this issue Jun 11, 2021

ingest/ledgerbackend: Restart captive core when a new version of core is detected on disk #3687

Merged

7 tasks

tamirms closed this as completed in #3687 Jun 14, 2021

tamirms mentioned this issue Sep 1, 2021

Auto reloading new versions of captive core relies on last modified timestamp which may be faulty #3882

Closed

tamirms mentioned this issue Apr 16, 2024

Add get_version_info endpoint stellar/stellar-rpc#132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart captive core when a new version of core is detected on disk #3602

Restart captive core when a new version of core is detected on disk #3602

ire-and-curses commented May 17, 2021

Restart captive core when a new version of core is detected on disk #3602

Restart captive core when a new version of core is detected on disk #3602

Comments

ire-and-curses commented May 17, 2021

What problem does your feature solve?

What would you like to see?

What alternatives are there?