You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Horizon running in captive-core mode spawns stellar-core as a subprocess. If the stellar-core package is upgraded after Horizon is started the core binary changes on disk but the captive-core child process is NOT restarted.
This means that we end up in a situation where the running core version can be different from the version installed. This behaviour is confusing, non-obvious and hard to spot and can have unexpected impact on services. An unscheduled reboot can cause surprising behaviour some time after the update.
In the case of a network-wide protocol upgrade although the operator may have updated the captive-core version, the running captive-core might be a version of the core binary that does not support new protocol, leading to surprise downtime.
Forgetting to stop/start the binary has caught us out in our own deployments more than once so is likely a problem for others.
The behaviour of restarting services is standard in the Linux package world (see e.g. Postgres). To control upgrade timing, the best practice approach is to hold packages. We believe this should be familiar to operators but we can consider providing an informative message in the Horizon logs.
What would you like to see?
Horizon monitors captive-core on disk, and restarts captive-core if it detects a change (e.g. a more recent file timestamp for the captive-core binary). This was discussed internally at SDF (doc).
This is a break with previous behaviour. We should document this change.
There is a risk that some users will experience unscheduled downtime as a result. However, latest core release candidate (17.1.0rc1) shrinks core sync time from over 5 minutes to around 30s (see stellar/stellar-core#2960), so in this case downtime will be small. Also, auto-updating Horizon is a bad idea, and forgetting to hold the package is only a mistake you make once as an operator.
What alternatives are there?
Keep things the way they are
Push responsibility for detecting changes on disk to operators (who would need additional tooling, e.g. Puppet)
It might be possible to build a meta package that forces the dependency to update in a post-installation step
The text was updated successfully, but these errors were encountered:
What problem does your feature solve?
Horizon running in captive-core mode spawns stellar-core as a subprocess. If the stellar-core package is upgraded after Horizon is started the core binary changes on disk but the captive-core child process is NOT restarted.
This means that we end up in a situation where the running core version can be different from the version installed. This behaviour is confusing, non-obvious and hard to spot and can have unexpected impact on services. An unscheduled reboot can cause surprising behaviour some time after the update.
In the case of a network-wide protocol upgrade although the operator may have updated the captive-core version, the running captive-core might be a version of the core binary that does not support new protocol, leading to surprise downtime.
Forgetting to stop/start the binary has caught us out in our own deployments more than once so is likely a problem for others.
The behaviour of restarting services is standard in the Linux package world (see e.g. Postgres). To control upgrade timing, the best practice approach is to hold packages. We believe this should be familiar to operators but we can consider providing an informative message in the Horizon logs.
What would you like to see?
Horizon monitors captive-core on disk, and restarts captive-core if it detects a change (e.g. a more recent file timestamp for the captive-core binary). This was discussed internally at SDF (doc).
This is a break with previous behaviour. We should document this change.
There is a risk that some users will experience unscheduled downtime as a result. However, latest core release candidate (17.1.0rc1) shrinks core sync time from over 5 minutes to around 30s (see stellar/stellar-core#2960), so in this case downtime will be small. Also, auto-updating Horizon is a bad idea, and forgetting to hold the package is only a mistake you make once as an operator.
What alternatives are there?
The text was updated successfully, but these errors were encountered: