Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evict actors after 10 sec inactivity in Workerd #1138

Merged
merged 3 commits into from
Oct 11, 2023

Conversation

MellowYarker
Copy link
Contributor

This behavior has been present in production for a while. Durable Objects with Hibernatable WebSockets should now be able to hibernate when using Wrangler or Miniflare.

Credit to Sam for adding the websocket test functions!

auto& actorContainer = actors.findOrCreate(id, [&]() mutable {
auto container = kj::heap<ActorContainer>(kj::str(id), *this, timer);

// TODO(sqlite): Now that actors are backed by real disk, we should shut them down after
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kentonv curious on why a minute of inactivity was picked here, does it matter if it's a minute vs. 10 sec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two relevant timeouts:

  • Hibernation occurs after 10 seconds of internal inactivity, that is, 10 seconds of not having any non-hibernatable work scheduled inside the isolate. Clients may still be connected.
  • Eviction occurs after 60 seconds (or is it 70 seconds? I can never remember) of not having any clients connected.

We should make sure to cover both of these in this PR. For example, if someone uses setInterval() to schedule a periodic callback, this should prevent hibernation, and the actor should only be evicted after the 60-second interval. Is that covered currently? (Haven't had a chance to look at the code yet.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer timeout isn't covered here, I can add it if we think it's worth doing now but the intention of this PR was to address #904 (DOs never evict in wrangler/miniflare when using ws hibernation, so state management bugs won't show up in testing but can in prod).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would argue the real goal is to simulate the behavior that would be seen in production, otherwise people can't properly test their code.

As this PR is written at present, what is the behavior in the case someone does setInterval()? Will the actor stay running forever to continue calling the interval callback, or will it be terminated after 10 seconds? (The production behavior is to keep running for 60/70 seconds after all clients have disconnected.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test (see second commit) that confirms if you call setInterval() then the 10 second inactivity timer will not start until the work is finished (i.e. clearInterval() is called); it'll run forever.

I also included a cleanup loop that erases from actors if getActorImpl() hasn't been called in 70 seconds, though this needs a bit more work. This is complicated a bit by the fact that we want to get rid of the actor instance, but keep the HibernationManager associated with it, and it feels like the top-level place to put that is in the entry of the actor map actors (since it's a member of ActorNamespace, which goes a bit too far). In other words, we don't want to delete the entry if it has hibernatable websockets.

I'll see if I can get around this by including a new HashMap of {ActorId -> HibernationManager} in ActorNamespace, that should let me fully remove the actor map entry.


The production behavior is to keep running for 60/70 seconds after all clients have disconnected.

How am I able to detect client disconnect here? I assume this in reference to a client no longer holding a DO stub, but the DO having work scheduled still, yes? In this case, it wouldn't be our RequestTracker that informs us if the client has disconnected, but something else instead (internally we capabilities but I don't think that happens in the Workerd implementation).

Do we need to detect if a WorkerInterface gets dropped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some confusion here. The 70-second timeout only applies after there are no clients at all. If there is an open WebSocket, even if it's using hibernation, that counts as a client, and so the 70-second timeout doesn't apply. So I think the HibernationManager can in fact be destroyed when the 70-second timeout kicks in, because there are necessarily no hibernated WebSockets at that point.

More concretely: ActorNamespace::getActor() returns an Own<WorkerInterface>, which the caller will hold until it no longer needs it. I think the timeout you want here is: 70 seconds after all of these WorkerInterface instances have been dropped (and no new ones created), you want to destroy the actor.

The 10-second hibernation timeout is a bit different. This applies when no work is scheduled and all client connections are hibernatable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying!

As of latest fixup, the cleanupLoop will not evict if we have any connected clients, and once the last client drops its Own<WorkerInterface> we'll set the lastAccess timestamp, and the cleanupLoop will eventually remove the entry from actors (unless another client connects).


Last obvious issue is there's a weird discrepancy between our internal eviction process and this one. Specifically this constructorFailedPaf promise is being rejected when I do the eviction by destroying the final Worker::Actor (which sets Worker::Actor::Impl to kj::none) in Workerd, but it doesn't get rejected in our internal hibernation tests (I think it's just cancelled as it doesn't resolve either?).

I assume that because we are destroying the constructorFailedPaf.fulfiller before we fulfill()/reject(), the promise is being rejected implicitly. It feels like onBroken() should be triggered when the actor actually breaks, not when we're evicting it for hibernation -- if the actor's onBroken() actually rejects then we should probably be removing the entry from actors even if we have hibernatable websockets (thereby disconnecting them).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that when you destroy the Worker::Actor, someone is still waiting on a promise from actor.onBroken(). This goes against general KJ rules, which say that an object must not be destroyed while a Promise returned by one of its methods still exists. What you need to do here is cancel whatever it is that is waiting on the onBroken() promise before you destroy the Actor object.

@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch 3 times, most recently from 18ee96f to 65d94b5 Compare September 11, 2023 17:34
@MellowYarker
Copy link
Contributor Author

Tests were failing because the actor ID was getting corrupted + a segfault in sql-test caused by the RequestTrackers hooks running after it was destroyed.

@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch 2 times, most recently from 0896fda to b6dedd9 Compare September 12, 2023 00:24
@MellowYarker
Copy link
Contributor Author

Rebased to resolve conflicts.

auto now = timer.now();
actors.eraseAll([&](auto&, kj::Own<ActorContainer>& entry) {

// TODO(now): What do we do if the actor is still in memory, but the last access time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I'm not fully up to date on this code but can you attach some sort of guard to promises we return for working with the actor? I notice onActorBroken() below at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, mind expanding a bit? You mean like a try...catch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I've lost context because of the outdated commit. I suspect that I was trying to say that you could track if the actor was evicted by if anybody was waiting on onActorBroken().

@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch 3 times, most recently from eebcc66 to 07a5279 Compare September 15, 2023 22:32
KJ_IF_SOME(m, a->getHibernationManager()) {
// The hibernation manager needs to survive actor eviction and be passed to the actor
// constructor next time we create it.
manager = kj::addRef(*static_cast<HibernationManagerImpl*>(&m));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This downcast looks like it's making a lot of assumptions, but I see similar downcasts exist elsewhere in the code too. I guess HibernationManagerImpl is the only allowed implementation of HibernationManager? If that's the case, could we possibly remove the inheritance and just have a single concrete type called HibernationManager?

This cleanup could be in a separate PR since it's not really a new issue introduced here.

@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch 2 times, most recently from 897e1d8 to 1ae1cf3 Compare September 22, 2023 13:23
@MellowYarker
Copy link
Contributor Author

api:sql-test is failing in CI with a segfault but not locally, will dig into it later.

src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
@MellowYarker
Copy link
Contributor Author

MellowYarker commented Sep 26, 2023

Adding a reminder to myself to also add a flag for Miniflare that prevents eviction.

Edit: Done in the 3rd commit

@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch 4 times, most recently from 0cbd55f to 8c924b3 Compare September 27, 2023 15:36
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
src/workerd/server/server.c++ Outdated Show resolved Hide resolved
@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch from 5da5a6d to 6c96c8e Compare September 27, 2023 22:04
@MellowYarker MellowYarker linked an issue Oct 2, 2023 that may be closed by this pull request
@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch from 6c96c8e to 1acf0e5 Compare October 3, 2023 13:56
@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch 3 times, most recently from 01811a2 to 53df814 Compare October 11, 2023 19:16
MellowYarker and others added 3 commits October 11, 2023 15:53
This behavior has been present in production for a while. Durable
Objects with Hibernatable WebSockets should now be able to hibernate
when using Wrangler or Miniflare.

Credit to Sam for adding the websocket test functions!

Co-authored-by: Sam Merritt <smerritt@cloudflare.com>
This is separate from the 10 second inactivity eviction.
Miniflare depends on Durable Objects staying in memory forever.
This commit provides a way to ensure a DO namespace cannot be evicted
(unless it is broken), thereby retaining the old behavior.
@MellowYarker MellowYarker force-pushed the milan/evict-inactive-actors branch from 40eef8c to d601081 Compare October 11, 2023 19:56
@MellowYarker
Copy link
Contributor Author

@ohodson
Copy link
Contributor

ohodson commented Oct 11, 2023

f7ccfe149441d007ee077be370040316f52d1381..d60108186837c2e074893c51e678bfab8e18db9a (compare)

Is the relevant diff to get the mac and windows builds to pass.

Testing with linux asan (bazel build --config=asan) should hopefully cover the failures that surface on Mac and Windows. We're trying to add more test configs in #1283, including linux asan.

@MellowYarker MellowYarker merged commit 3f16108 into main Oct 11, 2023
7 checks passed
@MellowYarker MellowYarker deleted the milan/evict-inactive-actors branch October 11, 2023 21:38
mrbbot added a commit to cloudflare/miniflare that referenced this pull request Oct 17, 2023
cloudflare/workerd#1138 introduced Durable Object's eviction behaviour
to `workerd`. We really don't want the `ProxyServer`'s singleton
object to be evicted, as this would invalidate proxy stubs' heap
addresses. This change makes sure the `preventEviction` flag is set.
mrbbot added a commit to cloudflare/miniflare that referenced this pull request Oct 17, 2023
…717)

* Bump `workerd` and versions to `3.20231016.0`

* Update `workerd` configuration schema and type definitions

Also add an override for `capnpc-ts`'s TypeScript version to prevent
issues re-generating types in the future.

* Prevent `ProxyServer` Durable Object eviction

cloudflare/workerd#1138 introduced Durable Object's eviction behaviour
to `workerd`. We really don't want the `ProxyServer`'s singleton
object to be evicted, as this would invalidate proxy stubs' heap
addresses. This change makes sure the `preventEviction` flag is set.
mrbbot added a commit to cloudflare/workers-sdk that referenced this pull request Oct 31, 2023
…717)

* Bump `workerd` and versions to `3.20231016.0`

* Update `workerd` configuration schema and type definitions

Also add an override for `capnpc-ts`'s TypeScript version to prevent
issues re-generating types in the future.

* Prevent `ProxyServer` Durable Object eviction

cloudflare/workerd#1138 introduced Durable Object's eviction behaviour
to `workerd`. We really don't want the `ProxyServer`'s singleton
object to be evicted, as this would invalidate proxy stubs' heap
addresses. This change makes sure the `preventEviction` flag is set.
mrbbot added a commit to cloudflare/workers-sdk that referenced this pull request Nov 1, 2023
…717)

* Bump `workerd` and versions to `3.20231016.0`

* Update `workerd` configuration schema and type definitions

Also add an override for `capnpc-ts`'s TypeScript version to prevent
issues re-generating types in the future.

* Prevent `ProxyServer` Durable Object eviction

cloudflare/workerd#1138 introduced Durable Object's eviction behaviour
to `workerd`. We really don't want the `ProxyServer`'s singleton
object to be evicted, as this would invalidate proxy stubs' heap
addresses. This change makes sure the `preventEviction` flag is set.
mrbbot added a commit to cloudflare/workers-sdk that referenced this pull request Nov 1, 2023
…717)

* Bump `workerd` and versions to `3.20231016.0`

* Update `workerd` configuration schema and type definitions

Also add an override for `capnpc-ts`'s TypeScript version to prevent
issues re-generating types in the future.

* Prevent `ProxyServer` Durable Object eviction

cloudflare/workerd#1138 introduced Durable Object's eviction behaviour
to `workerd`. We really don't want the `ProxyServer`'s singleton
object to be evicted, as this would invalidate proxy stubs' heap
addresses. This change makes sure the `preventEviction` flag is set.
mrbbot added a commit to cloudflare/workers-sdk that referenced this pull request Nov 1, 2023
…717)

* Bump `workerd` and versions to `3.20231016.0`

* Update `workerd` configuration schema and type definitions

Also add an override for `capnpc-ts`'s TypeScript version to prevent
issues re-generating types in the future.

* Prevent `ProxyServer` Durable Object eviction

cloudflare/workerd#1138 introduced Durable Object's eviction behaviour
to `workerd`. We really don't want the `ProxyServer`'s singleton
object to be evicted, as this would invalidate proxy stubs' heap
addresses. This change makes sure the `preventEviction` flag is set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: trigger hibernation in durable objects
5 participants