-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evict actors after 10 sec inactivity in Workerd #1138
Conversation
src/workerd/server/server.c++
Outdated
auto& actorContainer = actors.findOrCreate(id, [&]() mutable { | ||
auto container = kj::heap<ActorContainer>(kj::str(id), *this, timer); | ||
|
||
// TODO(sqlite): Now that actors are backed by real disk, we should shut them down after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kentonv curious on why a minute of inactivity was picked here, does it matter if it's a minute vs. 10 sec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two relevant timeouts:
- Hibernation occurs after 10 seconds of internal inactivity, that is, 10 seconds of not having any non-hibernatable work scheduled inside the isolate. Clients may still be connected.
- Eviction occurs after 60 seconds (or is it 70 seconds? I can never remember) of not having any clients connected.
We should make sure to cover both of these in this PR. For example, if someone uses setInterval()
to schedule a periodic callback, this should prevent hibernation, and the actor should only be evicted after the 60-second interval. Is that covered currently? (Haven't had a chance to look at the code yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer timeout isn't covered here, I can add it if we think it's worth doing now but the intention of this PR was to address #904 (DOs never evict in wrangler/miniflare when using ws hibernation, so state management bugs won't show up in testing but can in prod).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would argue the real goal is to simulate the behavior that would be seen in production, otherwise people can't properly test their code.
As this PR is written at present, what is the behavior in the case someone does setInterval()? Will the actor stay running forever to continue calling the interval callback, or will it be terminated after 10 seconds? (The production behavior is to keep running for 60/70 seconds after all clients have disconnected.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test (see second commit) that confirms if you call setInterval()
then the 10 second inactivity timer will not start until the work is finished (i.e. clearInterval()
is called); it'll run forever.
I also included a cleanup loop that erases from actors
if getActorImpl()
hasn't been called in 70 seconds, though this needs a bit more work. This is complicated a bit by the fact that we want to get rid of the actor instance, but keep the HibernationManager
associated with it, and it feels like the top-level place to put that is in the entry of the actor map actors
(since it's a member of ActorNamespace
, which goes a bit too far). In other words, we don't want to delete the entry if it has hibernatable websockets.
I'll see if I can get around this by including a new HashMap of {ActorId -> HibernationManager}
in ActorNamespace
, that should let me fully remove the actor map entry.
The production behavior is to keep running for 60/70 seconds after all clients have disconnected.
How am I able to detect client disconnect here? I assume this in reference to a client no longer holding a DO stub, but the DO having work scheduled still, yes? In this case, it wouldn't be our RequestTracker
that informs us if the client has disconnected, but something else instead (internally we capabilities but I don't think that happens in the Workerd implementation).
Do we need to detect if a WorkerInterface
gets dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some confusion here. The 70-second timeout only applies after there are no clients at all. If there is an open WebSocket, even if it's using hibernation, that counts as a client, and so the 70-second timeout doesn't apply. So I think the HibernationManager can in fact be destroyed when the 70-second timeout kicks in, because there are necessarily no hibernated WebSockets at that point.
More concretely: ActorNamespace::getActor()
returns an Own<WorkerInterface>
, which the caller will hold until it no longer needs it. I think the timeout you want here is: 70 seconds after all of these WorkerInterface
instances have been dropped (and no new ones created), you want to destroy the actor.
The 10-second hibernation timeout is a bit different. This applies when no work is scheduled and all client connections are hibernatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying!
As of latest fixup, the cleanupLoop
will not evict if we have any connected clients, and once the last client drops its Own<WorkerInterface>
we'll set the lastAccess timestamp, and the cleanupLoop
will eventually remove the entry from actors
(unless another client connects).
Last obvious issue is there's a weird discrepancy between our internal eviction process and this one. Specifically this constructorFailedPaf
promise is being rejected when I do the eviction by destroying the final Worker::Actor
(which sets Worker::Actor::Impl
to kj::none
) in Workerd, but it doesn't get rejected in our internal hibernation tests (I think it's just cancelled as it doesn't resolve either?).
I assume that because we are destroying the constructorFailedPaf.fulfiller
before we fulfill()/reject()
, the promise is being rejected implicitly. It feels like onBroken()
should be triggered when the actor actually breaks, not when we're evicting it for hibernation -- if the actor's onBroken()
actually rejects then we should probably be removing the entry from actors
even if we have hibernatable websockets (thereby disconnecting them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that when you destroy the Worker::Actor
, someone is still waiting on a promise from actor.onBroken()
. This goes against general KJ rules, which say that an object must not be destroyed while a Promise returned by one of its methods still exists. What you need to do here is cancel whatever it is that is waiting on the onBroken()
promise before you destroy the Actor
object.
18ee96f
to
65d94b5
Compare
Tests were failing because the actor ID was getting corrupted + a segfault in |
0896fda
to
b6dedd9
Compare
Rebased to resolve conflicts. |
src/workerd/server/server.c++
Outdated
auto now = timer.now(); | ||
actors.eraseAll([&](auto&, kj::Own<ActorContainer>& entry) { | ||
|
||
// TODO(now): What do we do if the actor is still in memory, but the last access time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, I'm not fully up to date on this code but can you attach some sort of guard to promises we return for working with the actor? I notice onActorBroken() below at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, mind expanding a bit? You mean like a try...catch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I've lost context because of the outdated commit. I suspect that I was trying to say that you could track if the actor was evicted by if anybody was waiting on onActorBroken()
.
eebcc66
to
07a5279
Compare
src/workerd/server/server.c++
Outdated
KJ_IF_SOME(m, a->getHibernationManager()) { | ||
// The hibernation manager needs to survive actor eviction and be passed to the actor | ||
// constructor next time we create it. | ||
manager = kj::addRef(*static_cast<HibernationManagerImpl*>(&m)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This downcast looks like it's making a lot of assumptions, but I see similar downcasts exist elsewhere in the code too. I guess HibernationManagerImpl
is the only allowed implementation of HibernationManager
? If that's the case, could we possibly remove the inheritance and just have a single concrete type called HibernationManager
?
This cleanup could be in a separate PR since it's not really a new issue introduced here.
897e1d8
to
1ae1cf3
Compare
|
Adding a reminder to myself to also add a flag for Miniflare that prevents eviction. Edit: Done in the 3rd commit |
0cbd55f
to
8c924b3
Compare
5da5a6d
to
6c96c8e
Compare
6c96c8e
to
1acf0e5
Compare
01811a2
to
53df814
Compare
This behavior has been present in production for a while. Durable Objects with Hibernatable WebSockets should now be able to hibernate when using Wrangler or Miniflare. Credit to Sam for adding the websocket test functions! Co-authored-by: Sam Merritt <smerritt@cloudflare.com>
This is separate from the 10 second inactivity eviction.
Miniflare depends on Durable Objects staying in memory forever. This commit provides a way to ensure a DO namespace cannot be evicted (unless it is broken), thereby retaining the old behavior.
40eef8c
to
d601081
Compare
Is the relevant diff to get the mac and windows builds to pass. |
Testing with linux asan (bazel build --config=asan) should hopefully cover the failures that surface on Mac and Windows. We're trying to add more test configs in #1283, including linux asan. |
cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.
…717) * Bump `workerd` and versions to `3.20231016.0` * Update `workerd` configuration schema and type definitions Also add an override for `capnpc-ts`'s TypeScript version to prevent issues re-generating types in the future. * Prevent `ProxyServer` Durable Object eviction cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.
…717) * Bump `workerd` and versions to `3.20231016.0` * Update `workerd` configuration schema and type definitions Also add an override for `capnpc-ts`'s TypeScript version to prevent issues re-generating types in the future. * Prevent `ProxyServer` Durable Object eviction cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.
…717) * Bump `workerd` and versions to `3.20231016.0` * Update `workerd` configuration schema and type definitions Also add an override for `capnpc-ts`'s TypeScript version to prevent issues re-generating types in the future. * Prevent `ProxyServer` Durable Object eviction cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.
…717) * Bump `workerd` and versions to `3.20231016.0` * Update `workerd` configuration schema and type definitions Also add an override for `capnpc-ts`'s TypeScript version to prevent issues re-generating types in the future. * Prevent `ProxyServer` Durable Object eviction cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.
…717) * Bump `workerd` and versions to `3.20231016.0` * Update `workerd` configuration schema and type definitions Also add an override for `capnpc-ts`'s TypeScript version to prevent issues re-generating types in the future. * Prevent `ProxyServer` Durable Object eviction cloudflare/workerd#1138 introduced Durable Object's eviction behaviour to `workerd`. We really don't want the `ProxyServer`'s singleton object to be evicted, as this would invalidate proxy stubs' heap addresses. This change makes sure the `preventEviction` flag is set.
This behavior has been present in production for a while. Durable Objects with Hibernatable WebSockets should now be able to hibernate when using Wrangler or Miniflare.
Credit to Sam for adding the websocket test functions!