Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(plan): use lanes to start services with dependencies #437

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Jun 26, 2024

When starting services, put services that depend on each other (before/after/requires) in the same lane. If there is no dependency, put it in a stand-alone lane. Tasks only wait for tasks defined in the same lane.

Fixes #231.

Logic

The logic is as follows:

  • get services from plan so that before/after/requires can be fetched
  • if a service doesn't depend on other services or isn't dependent by other services (it doesn't have before/after/requires, and it doesn't show up in other services' before/after/requires), put it in a single lane
  • if a service has dependencies (before/after/requires), all these dependencies will be put in the same lane as the service itself
  • tasks only wait for tasks of the same lane. Since the input is already sorted, the latter tasks of the same lane only need to wait for the previous task in the same lane.

Tests

I mainly ran three types of test. The first is the one mentioned in #231:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
  b:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
  c:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

With this feature, service C won't be in "hold" status, but rather, started. Logs:

2024-07-23T17:24:56.172Z [pebble] Started daemon.
2024-07-23T17:24:56.182Z [pebble] POST /v1/services 2.781792ms 202
2024-07-23T17:24:56.185Z [pebble] Service "a" starting: sh -c "sleep 999"
2024-07-23T17:24:56.185Z [pebble] Service "b" starting: sh -c "exit 123"
2024-07-23T17:24:56.186Z [pebble] Service "c" starting: sh -c "sleep 999"
2024-07-23T17:24:56.189Z [pebble] Change 1 task (Start service "b") failed: cannot start service: exited quickly with code 123
2024-07-23T17:24:57.190Z [pebble] GET /v1/changes/1/wait 1.005492792s 200
2024-07-23T17:24:57.190Z [pebble] Started default services with change 1.
2024-07-23T17:25:01.404Z [pebble] GET /v1/services?names= 120.625µs 200
^C2024-07-23T17:25:21.617Z [pebble] Exiting on interrupt signal.
2024-07-23T17:25:21.621Z [pebble] Stopping all running services.
2024-07-23T17:25:21.628Z [pebble] Service "c" stopped
2024-07-23T17:25:21.628Z [pebble] Service "a" stopped

As a second test, if C depends on B, C is still in "hold" because they are in the same lane:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
  b:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
  c:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
    requires:
      - b

Logs:

2024-07-23T17:25:37.799Z [pebble] Started daemon.
2024-07-23T17:25:37.810Z [pebble] POST /v1/services 2.859792ms 202
2024-07-23T17:25:37.813Z [pebble] Service "a" starting: sh -c "sleep 999"
2024-07-23T17:25:37.813Z [pebble] Service "b" starting: sh -c "exit 123"
2024-07-23T17:25:37.816Z [pebble] Change 1 task (Start service "b") failed: cannot start service: exited quickly with code 123
2024-07-23T17:25:38.819Z [pebble] GET /v1/changes/1/wait 1.005558292s 200
2024-07-23T17:25:38.820Z [pebble] Started default services with change 1.
2024-07-23T17:25:47.218Z [pebble] GET /v1/services?names= 103.208µs 200
^C2024-07-23T17:26:19.741Z [pebble] Exiting on interrupt signal.
2024-07-23T17:26:19.745Z [pebble] Stopping all running services.
2024-07-23T17:26:19.752Z [pebble] Service "a" stopped

As a third test, I tested the "requires + before" combination:

services:
  a:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
    requires:
      - b
    before:
      - b
  b:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

Logs:

2024-07-23T17:26:23.145Z [pebble] Started daemon.
2024-07-23T17:26:23.154Z [pebble] POST /v1/services 2.873792ms 202
2024-07-23T17:26:23.157Z [pebble] Service "a" starting: sh -c "exit 123"
2024-07-23T17:26:23.161Z [pebble] Change 1 task (Start service "a") failed: cannot start service: exited quickly with code 123
2024-07-23T17:26:23.167Z [pebble] GET /v1/changes/1/wait 12.322667ms 200
2024-07-23T17:26:23.167Z [pebble] Started default services with change 1.
2024-07-23T17:26:26.642Z [pebble] GET /v1/services?names= 583.833µs 200
^C2024-07-23T17:26:52.384Z [pebble] Exiting on interrupt signal.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Jun 26, 2024

Not in the scope of this PR (or the issue #231 we are trying to fix), but I'd like to point out the painful experience of using requires/before/after when implementing and testing this feature.

This PR is my second one on this feature because, in the first draft, I made the mistake of assuming that "when handling task B, if B depends on task A, A should have already been processed, since the input is sorted already, so, simply put B in the lane where A already is." I ignored the possibility that you can configure A to require B but A before B, for example, the test case 3 mentioned above:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
    requires:
      - b
    before:
      - b
  b:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

The logic in my first draft won't work, because the sorted order is (A, B), so when handling A, B's lane isn't known yet; and when handling B, since B doesn't require anything, you don't know which lane to put it in. To solve this problem, I created the second solution in this PR, where if A's dependency isn't known yet, create a lane, and put all dependencies in the same lane even before the task is created.

However, after giving this test case more thought, I concluded that the test case makes no sense, because "A requires B, but A must be started before B" is illogical:

  • If "A requires B", it means "if B doesn't start, A won't". Or to put it another way, "B must start before A".
  • If "A must be started before B", it means "B requires A", not the other way around.

So, saying that "A requires B" and "A must start before B" is illogical, the two parts counter each other.

I tried but failed to think of any use case where "A requires B", but "A must be started before B". Some analogies:

  • Let's say we want to create a CDN (say, with AWS CloudFront, as service A), and the origin server is hosted in an AWS S3 bucket (service B). This whole "CDN" product can only work when both CloudFront and S3 are present, but you must create S3 before creating CloudFront because CloudFront needs to know the S3 URL to load content from. So, CloudFront depends on S3 (A requires B), and you must create S3 (B) before creating CloudFront (A).
  • A web app comprises a backend (service A) and a database (B), and A requires B because the backend reads/writes to the DB. But you must start A before B? Starting A when B isn't started doesn't make sense because the whole product isn't ready for service anyway.
  • Yet another analogy: I want to create an EC2 instance (A) but it requires an IAM role (B). The IAM role must be created first because when creating the EC2 instance, I need to assign the role to the instance. So, the EC2 requires the IAM role (A requires B) and the IAM role must be created before the EC2 (A must start AFTER, not BEFORE, B).

Using three keywords requires/before/after to handle dependency is an overkill, which brings more problems than solves.

I think we can drop "before/after", and rename "requires" to something like "depend-on" (depend-on means requires + after, in my book).

I prefer renaming "requires" to "depends-on" because "requires" in many cases isn't right:

  • In the CDN analogy, the CloudFront doesn't "require" the S3 origin server. CloudFront can be created with an empty or invalid origin server. It's the CDN (the whole product that requires S3).
  • In the web app analogy, the backend also doesn't really "require" the DB; without the DB, when accessing the backend service, it would throw some error saying data can't be loaded/persisted. It's the whole web app product that requires the DB.

If we drop "before/after" and leave only "depends-on", the layer in test case 3 can be fixed to:

services:
    a:
        override: replace
        command: /bin/sh -c "echo test1; sleep 10"
        startup: enabled

    b:
        override: replace
        command: /bin/sh -c "echo test2; sleep 10"
        depends-on:
          - a

And this is more than enough to solve any dependency requirements. Proof: Terraform does the same thing, even the name "depends-on" is borrowed from Terraform.

If we can do this, after sorting, the order is (A, B) and when handling B, A's lane is already decided. In this case, the feature in this PR can be much simpler:

# pseudo
for task in tasks:
  if task.depends_on:
    task.lane = task.depends_on.lane
    task.wait_for(task.depends_on)

@IronCore864 IronCore864 marked this pull request as ready for review June 27, 2024 03:23
@IronCore864 IronCore864 requested a review from benhoyt June 27, 2024 03:24
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this when you're back. But per my comment, I think this would fit better in plan.StartOrder (and have that return a list of independent lanes).

internals/overlord/servstate/request.go Outdated Show resolved Hide resolved
internals/overlord/servstate/request.go Outdated Show resolved Hide resolved
@benhoyt
Copy link
Contributor

benhoyt commented Jul 2, 2024

Using three keywords requires/before/after to handle dependency is an overkill, which brings more problems than solves.

I think you're right -- it's overkill. And I and others have been confused by requires/before/after several times. There's more in the README here.

"Requires" basically means depends-on, but doesn't specify any order of starting. "Before" and "after" deal with the ordering.

I think we can drop "before/after", and rename "requires" to something like "depend-on" (depend-on means requires + after, in my book).

That said, I think that ship has sailed -- in other words, it's too late to change it now. We can probably refine before/after/requires, but not remove or change them significantly.

Let's chat about all this further when you're back, and then you can do a second iteration.

@IronCore864 IronCore864 changed the title feat: use lanes to start services with dependencies feat(plan): use lanes to start services with dependencies Jul 23, 2024
@IronCore864
Copy link
Contributor Author

IronCore864 commented Jul 24, 2024

Using three keywords requires/before/after to handle dependency is an overkill, which brings more problems than solves.

I think you're right -- it's overkill. And I and others have been confused by requires/before/after several times. There's more in the README here.

"Requires" basically means depends-on, but doesn't specify any order of starting. "Before" and "after" deal with the ordering.

I think we can drop "before/after", and rename "requires" to something like "depend-on" (depend-on means requires + after, in my book).

That said, I think that ship has sailed -- in other words, it's too late to change it now. We can probably refine before/after/requires, but not remove or change them significantly.

Let's chat about all this further when you're back, and then you can do a second iteration.

Refactored according to the above suggestion. Key changes:

  • Move the logic with lanes from servstate to plan.
  • Add UT.

Some explanation:

  • In the plan package, when assembling lanes [][]string, it makes sure services with dependencies (requires) will be put in the same lane, but the lanes are not created here. Because to use state.State.NewLane(), a State must be passed to the Plan and using State.NewLane() requires lock. Different indices here mean services of the same index will be put in the same lane, but the indices are not lane IDs since lanes are not created here yet.
  • The lanes are created in servestate.Start (and Stop) when creating tasks, where the lock is already aqquired.
  • plan.StartOrder (and StopOrder) used to return the order helper function directly, but since the helper func is also used elsewhere, I did not put the lane assembling part into order, but created a wrapper createLanes.

@IronCore864 IronCore864 requested a review from benhoyt July 24, 2024 04:47
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work looks good overall! Just requesting some minor changes.

internals/daemon/api_services.go Outdated Show resolved Hide resolved
internals/daemon/api_services.go Outdated Show resolved Hide resolved
internals/daemon/api_services.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/request_test.go Outdated Show resolved Hide resolved
internals/plan/plan.go Outdated Show resolved Hide resolved
@IronCore864 IronCore864 requested a review from benhoyt July 26, 2024 02:08
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Looking good now, and I've done some basic manual testing too.

@IronCore864 IronCore864 merged commit 48dc445 into canonical:master Jul 26, 2024
16 checks passed
@IronCore864 IronCore864 deleted the using-lanes-to-start-services-with-dependencies branch July 26, 2024 06:16
jujubot added a commit to juju/juju that referenced this pull request Aug 29, 2024
#17991

Per discussion with @hpidcock, update Juju 3.6 to use the just-released Pebble v1.16.0 ([changelog](https://github.com/canonical/pebble/releases/tag/v1.16.0)). Summary of main changes:

* Improvements to how services with dependencies are started and stopped (using "lanes"), so that independent services are started in parallel, and dependent services start up serially. Error handling is also improved. Fix in canonical/pebble#437.
* A fix for a tricky bug in pebble exec, so it doesn't lose output in interactive mode. Fix in canonical/pebble#466!
 prdesc Reduce the size of exec tasks by not storing the execSetup (which includes all environment variables) in state. This speeds up (and shrinks the data) when serialising state to disk during state.Unlock. Fix is in canonical/pebble#478.
 prdesc Ensure Pebble doesn't hang (with no error message) when the state file directory is read-only or otherwise inaccessible. Fix in canonical/pebble#481.
* Re-implement warnings using notices. This was always the intention since the notices feature was added (it was designed as a superset of warnings), but we'd never gotten to it. Lots of nice code deletion in canonical/pebble#486.

## QA steps

Deploy a K8s charm, like `snappass-test`, and ensure `pebble version --client` on the workload reports v1.16.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error starting one service causes others in task set to not start
2 participants