feat(plan): use lanes to start services with dependencies #437

IronCore864 · 2024-06-26T08:32:44Z

When starting services, put services that depend on each other (before/after/requires) in the same lane. If there is no dependency, put it in a stand-alone lane. Tasks only wait for tasks defined in the same lane.

Fixes #231.

Logic

The logic is as follows:

get services from plan so that before/after/requires can be fetched
if a service doesn't depend on other services or isn't dependent by other services (it doesn't have before/after/requires, and it doesn't show up in other services' before/after/requires), put it in a single lane
if a service has dependencies (before/after/requires), all these dependencies will be put in the same lane as the service itself
tasks only wait for tasks of the same lane. Since the input is already sorted, the latter tasks of the same lane only need to wait for the previous task in the same lane.

Tests

I mainly ran three types of test. The first is the one mentioned in #231:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
  b:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
  c:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

With this feature, service C won't be in "hold" status, but rather, started. Logs:

2024-07-23T17:24:56.172Z [pebble] Started daemon.
2024-07-23T17:24:56.182Z [pebble] POST /v1/services 2.781792ms 202
2024-07-23T17:24:56.185Z [pebble] Service "a" starting: sh -c "sleep 999"
2024-07-23T17:24:56.185Z [pebble] Service "b" starting: sh -c "exit 123"
2024-07-23T17:24:56.186Z [pebble] Service "c" starting: sh -c "sleep 999"
2024-07-23T17:24:56.189Z [pebble] Change 1 task (Start service "b") failed: cannot start service: exited quickly with code 123
2024-07-23T17:24:57.190Z [pebble] GET /v1/changes/1/wait 1.005492792s 200
2024-07-23T17:24:57.190Z [pebble] Started default services with change 1.
2024-07-23T17:25:01.404Z [pebble] GET /v1/services?names= 120.625µs 200
^C2024-07-23T17:25:21.617Z [pebble] Exiting on interrupt signal.
2024-07-23T17:25:21.621Z [pebble] Stopping all running services.
2024-07-23T17:25:21.628Z [pebble] Service "c" stopped
2024-07-23T17:25:21.628Z [pebble] Service "a" stopped

As a second test, if C depends on B, C is still in "hold" because they are in the same lane:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
  b:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
  c:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
    requires:
      - b

Logs:

2024-07-23T17:25:37.799Z [pebble] Started daemon.
2024-07-23T17:25:37.810Z [pebble] POST /v1/services 2.859792ms 202
2024-07-23T17:25:37.813Z [pebble] Service "a" starting: sh -c "sleep 999"
2024-07-23T17:25:37.813Z [pebble] Service "b" starting: sh -c "exit 123"
2024-07-23T17:25:37.816Z [pebble] Change 1 task (Start service "b") failed: cannot start service: exited quickly with code 123
2024-07-23T17:25:38.819Z [pebble] GET /v1/changes/1/wait 1.005558292s 200
2024-07-23T17:25:38.820Z [pebble] Started default services with change 1.
2024-07-23T17:25:47.218Z [pebble] GET /v1/services?names= 103.208µs 200
^C2024-07-23T17:26:19.741Z [pebble] Exiting on interrupt signal.
2024-07-23T17:26:19.745Z [pebble] Stopping all running services.
2024-07-23T17:26:19.752Z [pebble] Service "a" stopped

As a third test, I tested the "requires + before" combination:

services:
  a:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
    requires:
      - b
    before:
      - b
  b:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

Logs:

2024-07-23T17:26:23.145Z [pebble] Started daemon.
2024-07-23T17:26:23.154Z [pebble] POST /v1/services 2.873792ms 202
2024-07-23T17:26:23.157Z [pebble] Service "a" starting: sh -c "exit 123"
2024-07-23T17:26:23.161Z [pebble] Change 1 task (Start service "a") failed: cannot start service: exited quickly with code 123
2024-07-23T17:26:23.167Z [pebble] GET /v1/changes/1/wait 12.322667ms 200
2024-07-23T17:26:23.167Z [pebble] Started default services with change 1.
2024-07-23T17:26:26.642Z [pebble] GET /v1/services?names= 583.833µs 200
^C2024-07-23T17:26:52.384Z [pebble] Exiting on interrupt signal.

IronCore864 · 2024-06-26T09:23:32Z

Not in the scope of this PR (or the issue #231 we are trying to fix), but I'd like to point out the painful experience of using requires/before/after when implementing and testing this feature.

This PR is my second one on this feature because, in the first draft, I made the mistake of assuming that "when handling task B, if B depends on task A, A should have already been processed, since the input is sorted already, so, simply put B in the lane where A already is." I ignored the possibility that you can configure A to require B but A before B, for example, the test case 3 mentioned above:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
    requires:
      - b
    before:
      - b
  b:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

The logic in my first draft won't work, because the sorted order is (A, B), so when handling A, B's lane isn't known yet; and when handling B, since B doesn't require anything, you don't know which lane to put it in. To solve this problem, I created the second solution in this PR, where if A's dependency isn't known yet, create a lane, and put all dependencies in the same lane even before the task is created.

However, after giving this test case more thought, I concluded that the test case makes no sense, because "A requires B, but A must be started before B" is illogical:

If "A requires B", it means "if B doesn't start, A won't". Or to put it another way, "B must start before A".
If "A must be started before B", it means "B requires A", not the other way around.

So, saying that "A requires B" and "A must start before B" is illogical, the two parts counter each other.

I tried but failed to think of any use case where "A requires B", but "A must be started before B". Some analogies:

Let's say we want to create a CDN (say, with AWS CloudFront, as service A), and the origin server is hosted in an AWS S3 bucket (service B). This whole "CDN" product can only work when both CloudFront and S3 are present, but you must create S3 before creating CloudFront because CloudFront needs to know the S3 URL to load content from. So, CloudFront depends on S3 (A requires B), and you must create S3 (B) before creating CloudFront (A).
A web app comprises a backend (service A) and a database (B), and A requires B because the backend reads/writes to the DB. But you must start A before B? Starting A when B isn't started doesn't make sense because the whole product isn't ready for service anyway.
Yet another analogy: I want to create an EC2 instance (A) but it requires an IAM role (B). The IAM role must be created first because when creating the EC2 instance, I need to assign the role to the instance. So, the EC2 requires the IAM role (A requires B) and the IAM role must be created before the EC2 (A must start AFTER, not BEFORE, B).

Using three keywords requires/before/after to handle dependency is an overkill, which brings more problems than solves.

I think we can drop "before/after", and rename "requires" to something like "depend-on" (depend-on means requires + after, in my book).

I prefer renaming "requires" to "depends-on" because "requires" in many cases isn't right:

In the CDN analogy, the CloudFront doesn't "require" the S3 origin server. CloudFront can be created with an empty or invalid origin server. It's the CDN (the whole product that requires S3).
In the web app analogy, the backend also doesn't really "require" the DB; without the DB, when accessing the backend service, it would throw some error saying data can't be loaded/persisted. It's the whole web app product that requires the DB.

If we drop "before/after" and leave only "depends-on", the layer in test case 3 can be fixed to:

services:
    a:
        override: replace
        command: /bin/sh -c "echo test1; sleep 10"
        startup: enabled

    b:
        override: replace
        command: /bin/sh -c "echo test2; sleep 10"
        depends-on:
          - a

And this is more than enough to solve any dependency requirements. Proof: Terraform does the same thing, even the name "depends-on" is borrowed from Terraform.

If we can do this, after sorting, the order is (A, B) and when handling B, A's lane is already decided. In this case, the feature in this PR can be much simpler:

# pseudo
for task in tasks:
  if task.depends_on:
    task.lane = task.depends_on.lane
    task.wait_for(task.depends_on)

benhoyt

Let's discuss this when you're back. But per my comment, I think this would fit better in plan.StartOrder (and have that return a list of independent lanes).

internals/overlord/servstate/request.go

benhoyt · 2024-07-02T05:35:47Z

Using three keywords requires/before/after to handle dependency is an overkill, which brings more problems than solves.

I think you're right -- it's overkill. And I and others have been confused by requires/before/after several times. There's more in the README here.

"Requires" basically means depends-on, but doesn't specify any order of starting. "Before" and "after" deal with the ordering.

I think we can drop "before/after", and rename "requires" to something like "depend-on" (depend-on means requires + after, in my book).

That said, I think that ship has sailed -- in other words, it's too late to change it now. We can probably refine before/after/requires, but not remove or change them significantly.

Let's chat about all this further when you're back, and then you can do a second iteration.

IronCore864 · 2024-07-24T04:41:14Z

Using three keywords requires/before/after to handle dependency is an overkill, which brings more problems than solves.

I think you're right -- it's overkill. And I and others have been confused by requires/before/after several times. There's more in the README here.

"Requires" basically means depends-on, but doesn't specify any order of starting. "Before" and "after" deal with the ordering.

I think we can drop "before/after", and rename "requires" to something like "depend-on" (depend-on means requires + after, in my book).

That said, I think that ship has sailed -- in other words, it's too late to change it now. We can probably refine before/after/requires, but not remove or change them significantly.

Let's chat about all this further when you're back, and then you can do a second iteration.

Refactored according to the above suggestion. Key changes:

Move the logic with lanes from servstate to plan.
Add UT.

Some explanation:

In the plan package, when assembling lanes [][]string, it makes sure services with dependencies (requires) will be put in the same lane, but the lanes are not created here. Because to use state.State.NewLane(), a State must be passed to the Plan and using State.NewLane() requires lock. Different indices here mean services of the same index will be put in the same lane, but the indices are not lane IDs since lanes are not created here yet.
The lanes are created in servestate.Start (and Stop) when creating tasks, where the lock is already aqquired.
plan.StartOrder (and StopOrder) used to return the order helper function directly, but since the helper func is also used elsewhere, I did not put the lane assembling part into order, but created a wrapper createLanes.

benhoyt

This work looks good overall! Just requesting some minor changes.

internals/daemon/api_services.go

internals/overlord/servstate/manager.go

internals/overlord/servstate/manager_test.go

internals/overlord/servstate/request_test.go

internals/plan/plan.go

benhoyt

Thanks for this! Looking good now, and I've done some basic manual testing too.

@hpidcock

#17991 Per discussion with @hpidcock, update Juju 3.6 to use the just-released Pebble v1.16.0 ([changelog](https://github.com/canonical/pebble/releases/tag/v1.16.0)). Summary of main changes: * Improvements to how services with dependencies are started and stopped (using "lanes"), so that independent services are started in parallel, and dependent services start up serially. Error handling is also improved. Fix in canonical/pebble#437. * A fix for a tricky bug in pebble exec, so it doesn't lose output in interactive mode. Fix in canonical/pebble#466! prdesc Reduce the size of exec tasks by not storing the execSetup (which includes all environment variables) in state. This speeds up (and shrinks the data) when serialising state to disk during state.Unlock. Fix is in canonical/pebble#478. prdesc Ensure Pebble doesn't hang (with no error message) when the state file directory is read-only or otherwise inaccessible. Fix in canonical/pebble#481. * Re-implement warnings using notices. This was always the intention since the notices feature was added (it was designed as a superset of warnings), but we'd never gotten to it. Lots of nice code deletion in canonical/pebble#486. ## QA steps Deploy a K8s charm, like `snappass-test`, and ensure `pebble version --client` on the workload reports v1.16.0.

IronCore864 added 4 commits June 26, 2024 16:29

feat: use lanes when starting services

d0188b2

test: add test for newly added lane feature for starting services

9182407

test: fix failed manager test

29d17db

chore: fmt

d38d261

chore: refactor

ca1a9d4

IronCore864 marked this pull request as ready for review June 27, 2024 03:23

IronCore864 requested a review from benhoyt June 27, 2024 03:24

benhoyt requested changes Jul 2, 2024

View reviewed changes

internals/overlord/servstate/request.go Outdated Show resolved Hide resolved

internals/overlord/servstate/request.go Outdated Show resolved Hide resolved

IronCore864 changed the title ~~feat: use lanes to start services with dependencies~~ feat(plan): use lanes to start services with dependencies Jul 23, 2024

IronCore864 added 3 commits July 24, 2024 11:47

feat: move lanes logic from servstate to plan

5d6037f

chore: refactor and add tests

1ef278b

chore: fmt

b4ca4b7

IronCore864 requested a review from benhoyt July 24, 2024 04:47

benhoyt requested changes Jul 25, 2024

View reviewed changes

IronCore864 added 2 commits July 25, 2024 19:06

chore: refactor after code review

430b708

chore: refactor

37bc8ba

IronCore864 requested a review from benhoyt July 26, 2024 02:08

benhoyt approved these changes Jul 26, 2024

View reviewed changes

IronCore864 merged commit 48dc445 into canonical:master Jul 26, 2024
16 checks passed

IronCore864 deleted the using-lanes-to-start-services-with-dependencies branch July 26, 2024 06:16

benhoyt mentioned this pull request Aug 29, 2024

chore: update Pebble version to v1.16.0 juju/juju#17991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plan): use lanes to start services with dependencies #437

feat(plan): use lanes to start services with dependencies #437

IronCore864 commented Jun 26, 2024 •

edited

Loading

IronCore864 commented Jun 26, 2024 •

edited

Loading

benhoyt left a comment

benhoyt commented Jul 2, 2024

IronCore864 commented Jul 24, 2024 •

edited

Loading

benhoyt left a comment

benhoyt left a comment

feat(plan): use lanes to start services with dependencies #437

feat(plan): use lanes to start services with dependencies #437

Conversation

IronCore864 commented Jun 26, 2024 • edited Loading

Logic

Tests

IronCore864 commented Jun 26, 2024 • edited Loading

benhoyt left a comment

Choose a reason for hiding this comment

benhoyt commented Jul 2, 2024

IronCore864 commented Jul 24, 2024 • edited Loading

benhoyt left a comment

Choose a reason for hiding this comment

benhoyt left a comment

Choose a reason for hiding this comment

IronCore864 commented Jun 26, 2024 •

edited

Loading

IronCore864 commented Jun 26, 2024 •

edited

Loading

IronCore864 commented Jul 24, 2024 •

edited

Loading