Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(overlord): reversed order of manager stop #447

Merged
merged 2 commits into from
Jul 30, 2024

Conversation

Ardelean-Calin
Copy link
Contributor

Currently managers are started and stopped in a FIFO (first in, first out) order, but I would expect managers to behave more like a stack, in that they are started and stopped in a LIFO (last in first out) order. So the stopping of the managers should be done in reverse.

See this image:
image

Why? Because what if manager M2 started after manager M1 depends on some functionality from M1? Stopping M1 before stopping M2 might lead to unexpected and undefined behavior.

I chose to implement this behavior with the defer keyword as it behaves in a LIFO manner, but this could also be done by reverse iterating over the slice. I found the defer approach to be more concise, but feel free to correct me.

@benhoyt could you please take a look at this and advise me if I'm missing something?

@flotter
Copy link
Contributor

flotter commented Jul 11, 2024

@Ardelean-Calin Just a note here that the state engine is cross team maintained between pebble, snapd and some other teams (workspace engineering @dmitry-lyfar), so we should make sure the changes we make is double checked with them as well, otherwise a future cross-repo sync of the state code could slip in some changes that break existing critical functioning systems.

Note that the TaskRunner state-engine entry is explicitly done last for startup, and stop today. The task runner is special in that stopping it will kill and wait for all task tombs in the system to exit (block). Inverting this order will place task kill & wait first, which means state engine managers interacting with tasks will possibly observe a slightly different pattern, where a task can die not by the manager's own intent, but by the Task runner shutting down. This could be no issue, but I am pointing this out just to say that this very trivial change could expose a different execution path for snapd / pebble on shutdown, for interaction between managers and the task runner. Probably just needs some thorough testing on existing code bases.

@dmitry-lyfar
Copy link
Contributor

@Ardelean-Calin could you give an example of such a M1->M2 dependency that justifies the order change? My understanding is that managers interact solely via the state but do not depend on each other. Except for the mentioned case of the taskrunner that must be added last.

@benhoyt
Copy link
Contributor

benhoyt commented Jul 12, 2024

This does seem like a reasonable idea to me. FWIW, I've tried making this change on the snapd codebase, and the go test tests still seem to pass (though there were a lot of failing tests due to me not having installed a bunch of deps, but I don't think that's related). I've asked the snapd team what they think and I'll report back.

@thp-canonical
Copy link
Contributor

My understanding is that managers interact solely via the state but do not depend on each other.

The linked "solely via the state" quote seems to relate to .Ensure() calls at runtime, not necessarily for shutdown time (this quote is from Pebble to get nice inline source snippets in this comment, but snapd has the same text, as the two codebases are related):

// Most of the actual work performed by the state engine is in fact done
// by the individual managers registered. These managers must be able to
// cope with Ensure calls in any order, coordinating among themselves
// solely via the state.

Of course, there could be some hidden assumptions and code relying on the current Stop() call order, so we have to be careful here, but in general, teardown is usually in reverse order of setup calls (compare also: C++ inheritance and member destruction; defer in Go; RAII and destruction order of local (stack) variables in C++; getting onto a bus and sitting down, followed by standing up first before getting off the bus).

@dmitry-lyfar
Copy link
Contributor

@thp-canonical I'm not opposing the order's correctness but rather changing it and introducing such coupling that you have to depend on order in addition to already existing dependency on the taskrunner's order.

@pedronis
Copy link

I'm not against this. snapd is quite deliberate in the order in which it initializes managers, I don't think we do anything that requires precise ordering of Stops, but as I said it shouldn't be a problem either. You'll have to write precise tests though to make sure your managers are initialized in the order you need, as otherwise it seems this can get fragile. As people have mentioned the original idea was that managers would be orthogonal to each other, in practice some managers depend on others and that is setup by passing one to the other at initialization but then as you have seen the StateEngine is not aware of these except indirectly by seeing the AddManager order.

@benhoyt benhoyt changed the title feat(state): reversed order of manager stop feat(overlord): reversed order of manager stop Jul 23, 2024
@benhoyt
Copy link
Contributor

benhoyt commented Jul 23, 2024

Okay, with Samuele being "not against this", I'd like to merge this. Managers shouldn't generally be dependent on each other, but stopping them in reverse order definitely seems least surprising (I've added a comment to that effect).

I've also modified the code to directly iterate through the slice in reverse, rather than using defer in a loop to do the reversal (that is cute but seems less obvious to me).

As a drive-by I tweaked the (existing) debug log message to use our usual formatting: start with a capital letter, full phrases/sentences.

Ideally after we merge we can put up a similar PR against snapd.

Any final thoughts before I merge, @Ardelean-Calin or others?

@Ardelean-Calin
Copy link
Contributor Author

Ardelean-Calin commented Jul 30, 2024

Sorry for missing this thread! I don't have any more comments, will take care of the snapd tests. 😀 Thanks @benhoyt for the additional changes and feedback!

@benhoyt benhoyt merged commit 2aca1ff into canonical:master Jul 30, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants