Evict actors after 10 sec inactivity in Workerd #1138

MellowYarker · 2023-09-07T22:09:55Z

This behavior has been present in production for a while. Durable Objects with Hibernatable WebSockets should now be able to hibernate when using Wrangler or Miniflare.

Credit to Sam for adding the websocket test functions!

MellowYarker · 2023-09-07T22:10:48Z

src/workerd/server/server.c++

+        auto& actorContainer = actors.findOrCreate(id, [&]() mutable {
+          auto container = kj::heap<ActorContainer>(kj::str(id), *this, timer);
+
+          // TODO(sqlite): Now that actors are backed by real disk, we should shut them down after


@kentonv curious on why a minute of inactivity was picked here, does it matter if it's a minute vs. 10 sec?

There are two relevant timeouts:

Hibernation occurs after 10 seconds of internal inactivity, that is, 10 seconds of not having any non-hibernatable work scheduled inside the isolate. Clients may still be connected.

Eviction occurs after 60 seconds (or is it 70 seconds? I can never remember) of not having any clients connected.

We should make sure to cover both of these in this PR. For example, if someone uses setInterval() to schedule a periodic callback, this should prevent hibernation, and the actor should only be evicted after the 60-second interval. Is that covered currently? (Haven't had a chance to look at the code yet.)

Longer timeout isn't covered here, I can add it if we think it's worth doing now but the intention of this PR was to address #904 (DOs never evict in wrangler/miniflare when using ws hibernation, so state management bugs won't show up in testing but can in prod).

Hmm, I would argue the real goal is to simulate the behavior that would be seen in production, otherwise people can't properly test their code.

As this PR is written at present, what is the behavior in the case someone does setInterval()? Will the actor stay running forever to continue calling the interval callback, or will it be terminated after 10 seconds? (The production behavior is to keep running for 60/70 seconds after all clients have disconnected.)

I added a test (see second commit) that confirms if you call setInterval() then the 10 second inactivity timer will not start until the work is finished (i.e. clearInterval() is called); it'll run forever.

I also included a cleanup loop that erases from actors if getActorImpl() hasn't been called in 70 seconds, though this needs a bit more work. This is complicated a bit by the fact that we want to get rid of the actor instance, but keep the HibernationManager associated with it, and it feels like the top-level place to put that is in the entry of the actor map actors (since it's a member of ActorNamespace, which goes a bit too far). In other words, we don't want to delete the entry if it has hibernatable websockets.

I'll see if I can get around this by including a new HashMap of {ActorId -> HibernationManager} in ActorNamespace, that should let me fully remove the actor map entry.

The production behavior is to keep running for 60/70 seconds after all clients have disconnected.

How am I able to detect client disconnect here? I assume this in reference to a client no longer holding a DO stub, but the DO having work scheduled still, yes? In this case, it wouldn't be our RequestTracker that informs us if the client has disconnected, but something else instead (internally we capabilities but I don't think that happens in the Workerd implementation).

Do we need to detect if a WorkerInterface gets dropped?

I think there's some confusion here. The 70-second timeout only applies after there are no clients at all. If there is an open WebSocket, even if it's using hibernation, that counts as a client, and so the 70-second timeout doesn't apply. So I think the HibernationManager can in fact be destroyed when the 70-second timeout kicks in, because there are necessarily no hibernated WebSockets at that point.

More concretely: ActorNamespace::getActor() returns an Own<WorkerInterface>, which the caller will hold until it no longer needs it. I think the timeout you want here is: 70 seconds after all of these WorkerInterface instances have been dropped (and no new ones created), you want to destroy the actor.

The 10-second hibernation timeout is a bit different. This applies when no work is scheduled and all client connections are hibernatable.

Thanks for clarifying!

As of latest fixup, the cleanupLoop will not evict if we have any connected clients, and once the last client drops its Own<WorkerInterface> we'll set the lastAccess timestamp, and the cleanupLoop will eventually remove the entry from actors (unless another client connects).

Last obvious issue is there's a weird discrepancy between our internal eviction process and this one. Specifically this constructorFailedPaf promise is being rejected when I do the eviction by destroying the final Worker::Actor (which sets Worker::Actor::Impl to kj::none) in Workerd, but it doesn't get rejected in our internal hibernation tests (I think it's just cancelled as it doesn't resolve either?).

I assume that because we are destroying the constructorFailedPaf.fulfiller before we fulfill()/reject(), the promise is being rejected implicitly. It feels like onBroken() should be triggered when the actor actually breaks, not when we're evicting it for hibernation -- if the actor's onBroken() actually rejects then we should probably be removing the entry from actors even if we have hibernatable websockets (thereby disconnecting them).

This implies that when you destroy the Worker::Actor, someone is still waiting on a promise from actor.onBroken(). This goes against general KJ rules, which say that an object must not be destroyed while a Promise returned by one of its methods still exists. What you need to do here is cancel whatever it is that is waiting on the onBroken() promise before you destroy the Actor object.

src/workerd/server/server.c++

MellowYarker · 2023-09-11T17:38:24Z

Tests were failing because the actor ID was getting corrupted + a segfault in sql-test caused by the RequestTrackers hooks running after it was destroyed.

MellowYarker · 2023-09-12T00:24:56Z

Rebased to resolve conflicts.

bcaimano · 2023-09-12T16:29:59Z

src/workerd/server/server.c++

+        auto now = timer.now();
+        actors.eraseAll([&](auto&, kj::Own<ActorContainer>& entry) {
+
+          // TODO(now): What do we do if the actor is still in memory, but the last access time


Hmmmm, I'm not fully up to date on this code but can you attach some sort of guard to promises we return for working with the actor? I notice onActorBroken() below at least.

Sorry, mind expanding a bit? You mean like a try...catch?

Ugh, I've lost context because of the outdated commit. I suspect that I was trying to say that you could track if the actor was evicted by if anybody was waiting on onActorBroken().

src/workerd/server/server.c++

kentonv · 2023-09-18T20:18:52Z

src/workerd/server/server.c++

+          KJ_IF_SOME(m, a->getHibernationManager()) {
+            // The hibernation manager needs to survive actor eviction and be passed to the actor
+            // constructor next time we create it.
+            manager = kj::addRef(*static_cast<HibernationManagerImpl*>(&m));


This downcast looks like it's making a lot of assumptions, but I see similar downcasts exist elsewhere in the code too. I guess HibernationManagerImpl is the only allowed implementation of HibernationManager? If that's the case, could we possibly remove the inheritance and just have a single concrete type called HibernationManager?

This cleanup could be in a separate PR since it's not really a new issue introduced here.

src/workerd/server/server.c++

MellowYarker · 2023-09-22T13:50:39Z

api:sql-test is failing in CI with a segfault but not locally, will dig into it later.

src/workerd/server/server.c++

MellowYarker · 2023-09-26T14:08:07Z

Adding a reminder to myself to also add a flag for Miniflare that prevents eviction.

Edit: Done in the 3rd commit

src/workerd/server/server.c++

This behavior has been present in production for a while. Durable Objects with Hibernatable WebSockets should now be able to hibernate when using Wrangler or Miniflare. Credit to Sam for adding the websocket test functions! Co-authored-by: Sam Merritt <smerritt@cloudflare.com>

This is separate from the 10 second inactivity eviction.

Miniflare depends on Durable Objects staying in memory forever. This commit provides a way to ensure a DO namespace cannot be evicted (unless it is broken), thereby retaining the old behavior.

MellowYarker · 2023-10-11T19:58:35Z

https://github.com/cloudflare/workerd/compare/f7ccfe149441d007ee077be370040316f52d1381..d60108186837c2e074893c51e678bfab8e18db9a

Is the relevant diff to get the mac and windows builds to pass.

ohodson · 2023-10-11T20:07:16Z

f7ccfe149441d007ee077be370040316f52d1381..d60108186837c2e074893c51e678bfab8e18db9a (compare)

Is the relevant diff to get the mac and windows builds to pass.

Testing with linux asan (bazel build --config=asan) should hopefully cover the failures that surface on Mac and Windows. We're trying to add more test configs in #1283, including linux asan.

cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.

…717) * Bump `workerd` and versions to `3.20231016.0` * Update `workerd` configuration schema and type definitions Also add an override for `capnpc-ts`'s TypeScript version to prevent issues re-generating types in the future. * Prevent `ProxyServer` Durable Object eviction cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.

MellowYarker requested review from smerritt, jqmmes, kentonv and bcaimano September 7, 2023 22:09

MellowYarker commented Sep 7, 2023

View reviewed changes

jasnell reviewed Sep 7, 2023

View reviewed changes

src/workerd/server/server.c++ Outdated Show resolved Hide resolved

jasnell reviewed Sep 7, 2023

View reviewed changes

src/workerd/server/server.c++ Outdated Show resolved Hide resolved

MellowYarker mentioned this pull request Sep 7, 2023

feature request: trigger hibernation in durable objects #904

Closed

MellowYarker force-pushed the milan/evict-inactive-actors branch 3 times, most recently from 18ee96f to 65d94b5 Compare September 11, 2023 17:34

MellowYarker force-pushed the milan/evict-inactive-actors branch 2 times, most recently from 0896fda to b6dedd9 Compare September 12, 2023 00:24

bcaimano reviewed Sep 12, 2023

View reviewed changes

MellowYarker force-pushed the milan/evict-inactive-actors branch 3 times, most recently from eebcc66 to 07a5279 Compare September 15, 2023 22:32