Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: auto-restart services failed within okaydelay only when startup enabled #513

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Nov 7, 2024

This is a PoC where we auto-restart failed services only when startup is enabled.

Manual test:

Layer:

summary: a simple layer
services:
  test:
    override: replace
    command: bash -c "sleep 0.1; exit 0"
    startup: enabled

pebble run / pebble replan / pebble start test:

2024-11-07T10:05:52.613Z [pebble] Started daemon.
2024-11-07T10:05:52.624Z [pebble] POST /v1/services 2.4065ms 202
2024-11-07T10:05:52.626Z [pebble] Service "test" starting: bash -c "sleep 0.1; exit 0"
2024-11-07T10:05:52.736Z [pebble] Service "test" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2024-11-07T10:05:52.739Z [pebble] Change 1 task (Start service "test") failed: cannot start service: exited quickly with code 0
2024-11-07T10:05:52.744Z [pebble] GET /v1/changes/1/wait 119.344416ms 200
2024-11-07T10:05:52.744Z [pebble] Started default services with change 1.
2024-11-07T10:05:53.273Z [pebble] Service "test" starting: bash -c "sleep 0.1; exit 0"
2024-11-07T10:05:53.376Z [pebble] Service "test" stopped unexpectedly with code 0
2024-11-07T10:05:53.376Z [pebble] Service "test" on-success action is "restart", waiting ~1s before restart (backoff 2)
2024-11-07T10:05:54.414Z [pebble] Service "test" starting: bash -c "sleep 0.1; exit 0"
2024-11-07T10:05:54.521Z [pebble] Service "test" stopped unexpectedly with code 0
2024-11-07T10:05:54.521Z [pebble] Service "test" on-success action is "restart", waiting ~2s before restart (backoff 3)
2024-11-07T10:05:56.701Z [pebble] Service "test" starting: bash -c "sleep 0.1; exit 0"
2024-11-07T10:05:56.806Z [pebble] Service "test" stopped unexpectedly with code 0
2024-11-07T10:05:56.806Z [pebble] Service "test" on-success action is "restart", waiting ~4s before restart (backoff 4)

Behaviour for services not configured with startup: enabled isn't affected.

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things about this PoC:

  • This modifies the "autorestart" API command, but I don't think that's what Harry was suggesting. Rather, the autoRestart=true mode should be enabled if startup: enabled was specified in the Pebble plan, regardless of which start API command is used.
  • You shouldn't need to modify the low-level Task struct. You can Set arbitrary data on the task already. But even that I don't think we need -- the servstate manager already has a copy of the plan, so only the service name is needed in the task. Then in doStart you'd check if config.Startup == plan.StartupEnabled.

@IronCore864 IronCore864 force-pushed the restart-services-failed-within-okaydelay-with-startup-enabled-2 branch from 8c89e62 to 8710543 Compare November 7, 2024 10:15
@IronCore864
Copy link
Contributor Author

Force-pushed a new commit according to Ben's comment.

@hpidcock
Copy link
Member

hpidcock commented Nov 8, 2024

This isn't what I had in mind. I was expecting we would transition directly to running (instead of starting+okDelay timer) and have that be the flow. This change ignores all the other on-failure/on-success actions if it finishes in 1s.

@hpidcock
Copy link
Member

hpidcock commented Nov 8, 2024

diff --git a/internals/overlord/servstate/handlers.go b/internals/overlord/servstate/handlers.go
index 24172e0..09286d7 100644
--- a/internals/overlord/servstate/handlers.go
+++ b/internals/overlord/servstate/handlers.go
@@ -331,9 +331,13 @@ func (s *serviceData) start() error {
                if err != nil {
                        return err
                }
-               s.transition(stateStarting)
-               time.AfterFunc(okayDelay, func() { logError(s.okayWaitElapsed()) })
-
+               if s.config.Startup == plan.StartupEnabled {
+                       s.started <- nil
+                       s.transition(stateRunning)
+               } else {
+                       s.transition(stateStarting)
+                       time.AfterFunc(okayDelay, func() { logError(s.okayWaitElapsed()) })
+               }
        default:
                return fmt.Errorf("cannot start service while %s", s.state)
        }

@benhoyt
Copy link
Contributor

benhoyt commented Nov 8, 2024

Yeah, what @hpidcock said sounds better (and simpler). Probably also easier to reason about and document, as we can document it in terms of states / state transitions too.

@IronCore864
Copy link
Contributor Author

Closing this PoC PR after an internal discussion with Ben and Haryr, I will create a new one to according to the discussion.

@IronCore864 IronCore864 deleted the restart-services-failed-within-okaydelay-with-startup-enabled-2 branch November 11, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants