-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add acceptance queue #4638
Add acceptance queue #4638
Conversation
b9d3bf4
to
02b566c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't make heads or tails of this with something called "syscall queue". I tried just treating it as an arbitrary, meaningless label like "cromfelter" but the cognitive interference with the actual words overwhelmed me and I just couldn't pull it off. Please either rename the abstraction or provide a deeper analysis to help me understand why that name makes any kind of sense. In particular, I don't understand the demarcation that the second queue is trying to capture.
@@ -89,7 +89,7 @@ export function initializeKernel(config, hostStorage, verbose = false) { | |||
vatKeeper.setSourceAndOptions({ bundleID }, creationOptions); | |||
vatKeeper.initializeReapCountdown(creationOptions.reapInterval); | |||
if (!creationOptions.enableSetup) { | |||
kernelKeeper.addToRunQueue(harden({ type: 'startVat', vatID })); | |||
kernelKeeper.addToSyscallQueue(harden({ type: 'startVat', vatID })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally don't understand the concept of "syscall queue", but maybe it's just a really bad name for whatever it actually is. The run queue is a queue of things that go into vats, i.e., things that will result in deliveries, or things that are done to vats.. Syscalls are things that come out of a vat or done by a vat. Syscalls are also synchronous, so the idea of a queue involving them makes no sense to me.
Yes I struggle with the naming as well, but couldn't come up with anything better. Suggestions welcome. In the discussion with @dtribble and @warner, and summarized in #3465 (comment), we were talking about inbound and outbound queues. However if you don't know the reference point, that terminology is even more confusing (the vat was the reference point, so inbound were pending message deliveries into the vat, and outbound were pending messages generated by the vat). When I looked at syscalls, I realized there are conceptually 2 kinds of syscalls:
The idea is to put all the asynchronous syscall messages as-is into a queue (to be named). The kernel then pops those messages from the queue and routes or process them appropriately. For a message send, the message may be put on the delivery queue or to a promise queue, depending on the type and state of the target. For a resolve, the promise state would be updated and notifications queued for delivery to subscribers. For a subscribe, either the subscribing vat would be added to the list of subscribers, or get a notification queued back, depending on the state of the promise. For a single pair of queues, like in this PR, this does not really1 change the order of message delivery. However once we switch to different queues per vat (#3465), and the kernel is free to process messages from vats independently from each other, then this allows limiting the effect a delivery has on the overall system, since asynchronous activity from a vat is queued and processed one at a time. I actually considered also renaming the current Footnotes
|
"Routing queue" is better, especially when paired with "delivery queue", but possibly we can do even better. That pairing at least hints at a functional demarcation. I remain a little concerned that whatever thought process generated the term "syscall queue" in the first place was at least partially confused about the conceptualization behind this and therefore what this code does might also be. I know that I'm certainly confused about it. You also used the phrase "syscall messages" in your comment, which makes no sense at all to me and suggests that there is something foundational here that is wrong. Probably this is a worthy agenda item for our next kernel meeting. |
Btw, I saw this and wondered what a "cromfelter" might actually be. So I googled and the first hit was http://habitatchronicles.com/2018/01/pool-abuse/ . Now I know ;) |
Another issue arising is what to do regarding the In the case of a clean exit, I'm imagining we could process these previous syscalls first (not what this PR does), so can afford to delay the termination until a later crank. However in case of a failure, I'm not sure if we need the termination to be recorded in the same crank. I assume as long as all pending "syscalls" are processed, it doesn't matter if the vat is terminated or not? |
cb83115
to
46a5cdc
Compare
46a5cdc
to
a1fcbeb
Compare
Renamed |
With a distinct vat-inbound queue, we can do the same unwinding, but then reject everything in the inbound queue immediately. No need to wait for a redelivery, which (given multiple queues) could take an unknown amount of time to be serviced. The real question is what about messages that are already in the vat-outbound queue. The vat observed them to be sent successfully, and has no control over how the outbound queue is serviced, so if the kernel deletes the queue upon termination, basically some random portion of the sent message would get deleted. I don't see how programmers could reason about that. But the alternative is to keep the vat-outbound queue around (occasionally servicing the remaining messages) even though the vat itself is terminated, which is pretty awkward. The vat lifecycle would grow from "non-existent -> active -> non-existent" to "non-existent -> active -> lingering-death -> non-existent". Maybe if we track vat-outbound queues in one data structure, and vats themselves in a separate one, it wouldn't be too hard to manage deleting the vat but retaining the queue until it is empty (and then deleting it). |
That is not my reading of agoric-sdk/packages/SwingSet/src/kernel/kernel.js Lines 309 to 318 in a1fcbeb
|
I think @mhofman is right on this. A non-aborting exit shuts down the vat after the state changes made during the crank are committed. This means that although the state of the vat itself goes away, any consequences to the kernel data structures (notably the runQueue) are retained, i.e., messages that were sent are actually sent. However, I don't know that this introduces any new complications above and beyond the (non-trivial!) complications that have already been raised. |
Ah, ok, you're right, Vats which somehow perform an illegal operation (or simply an infinite loop) and get terminated non-cleanly should not expect that any messages they sent within the same delivery will be committed, and unwinding the state takes care of that. The question of whether previous-cranks' messages get deleted or eventually delivered remains, and should have the same answer as for The same is true for vats that are killed externally, by |
Right, I think the summary is that this PR changes nothing to vat termination since the acceptance queue is simply the tail of the previous run-queue. Vat exit will only become a question once we hold a queue pair per vat. At that point, to preserve existing delivery guarantees (message sent in previous cranks get delivered regardless if vat terminates in a later crank), the vat outbound queue (whatever has been committed to) needs to get drained after the vat is marked as terminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, small changes suggested (and slightly-expanded test requested).
|
||
All methods should return `true` if the kernel should keep running, or `false` if it should stop. | ||
|
||
The `computrons` argument may be `undefined` (e.g. if the crank was delivered to a non-`xs worker`-based vat, such as the comms vat). The policy should probably treat this as equivalent to some "typical" number of computrons. | ||
|
||
`crankFailed` indicates the vat suffered an error during crank delivery, such as a metering fault, memory allocation fault, or fatal syscall. We do not currently have a way to measure the computron usage of failed cranks (many of the error cases are signaled by the worker process exiting with a distinctive status code, which does not give it an opportunity to report back detailed metering data). The run policy should assume the worst. | ||
|
||
`emptyCrank` indicates the kernel processed a queued messages which didn't result in a delivery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Half of me wants to not tell the runPolicy about these. I don't know how the host application author should react: how many are too many? I guess a basic approach would be to say "no more than 200 emptyCrank
s in a block". Would that protect against some sort of infinite cyclic loop or DoS attack?
Let's expand on the comment a bit, add another sentence or two about what situation can cause this (syscall.send
to a rejected promise?). It seems like anything pulling from a vat-outbound-queue will never result in a delivery, so maybe this happens like 50% of the time, and if so the host app author should set their threshold a lot higher.
Hm, maybe we need to rename policy.crankComplete
to deliveryComplete
, change crankFailed
to deliveryFailed
, and then add a crankHappened
which means the kernel did work but not necessarily a vat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I'm ok with this PR landing without that change, if we make a ticket to improve it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being more explicit to which kind of crank happened would indeed be good. Probably makes sense to include as part of the next step when changing the processing of the acceptance queue.
@@ -89,7 +89,7 @@ export function initializeKernel(config, hostStorage, verbose = false) { | |||
vatKeeper.setSourceAndOptions({ bundleID }, creationOptions); | |||
vatKeeper.initializeReapCountdown(creationOptions.reapInterval); | |||
if (!creationOptions.enableSetup) { | |||
kernelKeeper.addToRunQueue(harden({ type: 'startVat', vatID })); | |||
kernelKeeper.addToAcceptanceQueue(harden({ type: 'startVat', vatID })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly that "acceptance queue" is the vat-outbound-queue (populated by vat syscalls), then I'd think startVat
goes on the other one (the vat-inbound queue).
Also, it's critical that startVat
gets delivered to a vat before anything else is delivered to that vat, regardless of whatever fairness sampling we might implement later, otherwise messages could arrive at an incomplete vat. Putting startVat
on the vat-inbound queue should guarantee this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes probably startVat
could go straight to the runQueue
. However I didn't want to change any message ordering in this PR, so I always enqueue everything at the end of the acceptanceQueue
for now. Happy to change the potential order already in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I always enqueue everything at the end of the acceptanceQueue for now
Ah! That totally makes sense and explains a large amount of my confusion. Now I understand why I didn't understand what the demarcation criteria were.
/** @type {Promise<PolicyInput> | undefined} */ | ||
let resultPromise; | ||
let message = getNextAcceptanceMessage(); | ||
if (message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer a short-circuiting style:
let message = getNextAcceptanceMessage();
if (message) {
return processAcceptanceMessage(message);
}
message = getNextDeliveryMessage();
if (message) {
return processAcceptanceMessage(message);
}
But the fact that you're returning a record, not just a promise, tells me that you've got something in mind for this, so ignore me if your plans don't align with my aesthetics.
BTW the return value of getNextMessage
has an extra significance that needs to be kept in mind (it looks like your code does the right thing, but I wanted to be explicit about the requirement). If there's no work to be done, we don't commit any state changes. If e.g. maybeProcessNextMessage
pulled something off a queue and then decided that we didn't need to do anything, we'd have a problem, because now there are outstanding state changes that need to be committed. If the kernel doesn't commitCrank
before returning control to the host app, we're going to have consistency problems (sometimes the pop-from-queue changes get committed, sometimes they don't, depending upon what happens next time).
I think you're headed this way, but we should aim for a pattern in which crank commits happen at the end of every crank, regardless of whether that crank includes a delivery or not. When deliveries are unwound, we unwind the entire crank buffer, and we need to make sure that we still pop off whatever queue entry started the process. The getNextMessage
call the discardFailedDelivery
branch of processQueueMessage
does this, and we need to make sure it still does the right thing no matter what queue it came from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning a record, not just a promise
In general I don't like have a return value of Promise | undefined
as it would mean something is maybe async. I also prefer short circuiting, but in this case it didn't feel right.
I am planning on ripping this ugly function apart in the next PR, and didn't want to spend time making something that was right, but happy to revisit.
I think you're headed this way, but we should aim for a pattern in which crank commits happen at the end of every crank, regardless of whether that crank includes a delivery or not
I believe that's already the case. The only time we don't create a crank is if there is no message at all. I will keep making sure that a crank is always created and committed or aborted if a message is popped from a queue.
@@ -1013,7 +1046,7 @@ export default function buildKernel( | |||
const source = { bundle }; | |||
const vatID = kernelKeeper.allocateUnusedVatID(); | |||
const event = { type: 'create-vat', vatID, source, dynamicOptions }; | |||
kernelKeeper.addToRunQueue(harden(event)); | |||
kernelKeeper.addToAcceptanceQueue(harden(event)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. FYI, this call only happens while the vatAdmin
vat is running, because it's (normally) the only one with access to devices.vatAdmin
, and this is an endowment provided only to the device. Putting it on the acceptance queue kinda claims to rate-limit the creation of new vats, but that's already limited by the ability to deliver vatAdminService~.createVat()
deliveries into vatAdmin.
I think it might be better to drop this on the other queue. Or to remember (maybe add a note) that this wants to go onto its own queue (not associated with any particular vat). The create-vat
events are not vat syscalls (even though they're triggered by a syscall.callNow(vatAdminDevice)
from vatAdminVat), and devices.vatAdmin
does not know which vat is asking it to use its kernel endowment, and we're not considering having device-outbound queues to match the vat-outbound queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to remember (maybe add a note) that this wants to go onto its own queue (not associated with any particular vat).
Planning on revisiting all enqueuing in the next PR. As stated I'm trying not to change any order in this PR.
|
||
kernelKeeper.addToRunQueue(message); | ||
|
||
kernelKeeper.processRefcounts(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my feeling is that we could skip this processRefcounts
because nothing could change them during a simple queue-to-queue transfer, but it's safer to leave it in, and in the future the transfer function might cause promises to be rejected or messages to go splat, and those certainly would change refcounts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning on revisiting refcounts in the next PR: currently popping a send
from the runQueue decrements refcounts, and re-increments them when re-queueing on the promise queue. Obviously that's extra work we won't usually need when popping a send
from the outbound / acceptance queue since it will likely move to another queue (except as you say if sending to an invalid promise).
a1fcbeb
to
ede05b2
Compare
@@ -1016,7 +1050,7 @@ export default function buildKernel( | |||
const source = { bundle }; | |||
const vatID = kernelKeeper.allocateUnusedVatID(); | |||
const event = { type: 'create-vat', vatID, source, dynamicOptions }; | |||
kernelKeeper.addToRunQueue(harden(event)); | |||
kernelKeeper.addToAcceptanceQueue(harden(event)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm unclear on the demarcation criteria that determine which deliveries get initially put on the acceptanceQueue and which get put on the runQueue directly. My intuition says that 'startVat'
should be one of the latter, but I'm not sure what standard I should be judging on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, I believe the standard should be based on where the message comes from, and if we know the destination. For kernel initiated events, we could imagine having an "outbound queue" for the kernel with high priority. Or we could bypass that and enqueue directly onto the target.
Regardless, it's a problem I'm punting on until the next PR, as here I'm taking the stance that everything goes through the acceptance queue first, so that this PR doesn't change any delivery order.
return undefined; | ||
} | ||
|
||
function maybeProcessNextMessage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the name. I take the "maybe" here to mean "maybe there will be a message to process, maybe there won't, but if there is I will" rather than "maybe I'll process the next message, maybe I won't process it", but name hints at the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processNextMessageIfAny
?
await c.step(); // vat starts | ||
await c.step(); // vat start acceptance | ||
// eslint-disable-next-line no-await-in-loop | ||
await c.step(); // vat start deliver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gibson042 has a PR #4678 that's about to land that gets rid of some of this fussy c.step()
stuff, which should subsume some of the changes you've had to make here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I understand that passage through the acceptance queue is currently unconditional, this makes a lot more sense. I'm looking forward to the next step in this process.
ede05b2
to
d63bce5
Compare
refs: #3465
Best reviewed commit by commit
Description
This is a preliminary refactor splitting the run-queue in 2 before #3465. All
send
,resolve
andsubscribe
syscalls, as well as kernel initiated deliveries are first enqueued to an "acceptance" queue, which is drained into the usual "run" queue in its own cranks.The next step (follow-up PR) is to change the handling of the
resolve
andsubscribe
syscalls so that they are enqueued onto the acceptance queue first instead of mutating promise state immediately (except for reference counting). Processing those andsend
events from the acceptance queue would then be as follow:send
onto promises are enqueued onto the promise queue or run-queue, depending on the state of the promise, instead of going through the run-queue as-is (see syscall.send to resolved promise should run-queue to object, not to promise #4542).send
onto objects are simply moved to the run-queueresolve
would queue notifications on the run-queue for subscribers, and movesend
from the promise queue to the run-queuesubscribe
would either add the vat to the promise subscriber list, or enqueue a notify back to the vat, depending on the promise stateThe goal is to only have messages with a known destination vat pending on the run-queue, so that the run (aka delivery) and acceptance queues can be split per vat according to #3465 (comment)
The
maybeProcessNextMessage
is awkward and a bit of a temporary hack. I'll rethink handling of messages once processing the acceptance queue does more than simply re-queueing the message.This change also introduces an
emptyCrank
call to the run-policy. It was already possible to have empty cranks (that did no delivery before), e.g. in the case of a send queued on an unresolved promise. With acceptance queue processing this is now much more common.This PR does not change anything to vat termination, which will still be processed at the end of the current crank. Currently the delivery to vat-admin informing of the vat termination is enqueued, like every other delivery generated from the kernel, onto the acceptance queue, thus after any pending syscall events generated by the exited vat. In the future with queues per vat, we can imagine that kernel initiated deliveries use their own acceptance queue, or are queued directly onto the target's queue.
This PR also contains a fix for a test inconsistency described in https://github.com/Agoric/agoric-sdk/pull/4575/files#r809781876
Security Considerations
None
Documentation Considerations
This change should only impact the number of cranks necessary to process messages, but nothing else.
Testing Considerations
Tests updated to account for different amount of cranks, and the new state of the queue, but no specific tests were added since the existing tests seem to cover the updated code already.