overhaul solo-to-chain txn sending loop #2855

warner · 2021-04-11T23:55:30Z

What is the Problem Being Solved?

@michaelfig and I were looking at the code in cosmic-swingset that delivers messages from a solo node to the chain (mostly in https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/lib/ag-solo/chain-cosmos-sdk.js and https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/lib/ag-solo/outbound.js). This code extracts outbound message payloads from the swingset "mailbox" device, and incorporates them into special cosmos-sdk transaction messages. It then submits these txns to an external helper program (ag-cosmos-helper) for signing and broadcast to the chain, via one of the configured RPC ports.

The current approach has several limitations:

all outbound messages wait for their txn to be included in a block: this is not guaranteed to happen quickly, especially if the chain is busy, and the mempool does not completely drain between blocks
- this appears to include the time to process the kernel run-queue, which includes all vat deliveries triggered by the message
- if those deliveries include the creation of a new dynamic vat, or the evaluation of significant amounts of code (like contract source), the crank may take 20-30s to finish
  - the helper process times out after 10s and assumes the txn has failed, although in fact it will succeed
  - when the helper tries again, it uses the old sequence number, which will be consumed by the succeeding txn, so the retry is rejected
only one outbound helper process is allowed to run at a time (for efficiency, and to avoid reusing seqnums), which limits us to one set of messages in flight per block
the code that prepares message sets for transmission does not catch up fast enough: each message set is frozen when the message is generated and the 1s Nagle timer expires, and when that set makes it through the queue (one set per block), it doesn't include all the messages waiting behind it
- the consequence is that a busy ag-solo (e.g. being driven by a load-generator that produces a new cycle before the previous one has been completely retired, 20-30s) never catches up, and the queue depth increases without bound
- each txn should include all messages that are waiting to be delivered to the chain, or at least all that can fit in the maximum txn size
the code doesn't limit the txn size at all; if there were a lot of messages to go out, they could exceed some max-txn size and fail, when it really ought to send the oldest N bytes each time

Description of the Design

The swingset mailbox device is designed to manage hangover-inconsistency prevention, and bridge the gap between the swingset+host's atomicity domain, and the message-delivery IO channel needed to deliver those messages. This mailbox holds a set of numbered messages that want to be delivered to each known remote system, as well as an ack number (the highest inbound sequence number we've seen from that system: we're telling them it's safe to stop publishing all lower-numbered messages because we've safely processed them already). At any given time, the kernel wants these messages to be made available to the remote system.

The host is responsible for executing a specific sequence of events:

when the kernel is idle, invoke device calls to submit inbound messages into the kernel
then use c.run() to turn the kernel crank until all work is done (or c.step() to limit the amount we do)
then commit all DB state: any crash before this point will come back up to the previous commitment point, any crash afterwards will come back up to this commitment point
then examine the mailbox and deliver the messages to the remote
- the host must not allow messages to be released until the DB has committed, to prevent hangover inconsistency

The set of outbound messages grows when kernel activity causes messages to be sent to a remote system, and shrinks when an ACK is received from the remote system. In this sense, the outbound messages form a "pool", more than a queue. There is no "connection" to each remote system to go up and down: swingset works like Waterken.

The link to each remote system will be carried over various types of links. When the link is connection-oriented (coming and going as the processes on each end are rebooted, as TCP links are made and broken), the host must manage the mismatch between swingset's style and the TCP style.

When we get around to building a solo-to-solo protocol (#2484), it will need to try to maintain a TLS/TCP connection between the two sides. Each time this connection comes up, the sender should attempt to send all pending messages (because we don't know what previous messages might have made it through or not). But as long the connection remains up, we don't need to send any message twice. We'll need a state machine which tracks the ephemeral connection state and the message pool, sending messages when new ones are available and/or when the connection comes up each time.

For the solo-to-chain protocol, we don't use a long-running TCP connection. Instead, messages are delivered in signed transactions to the chain (submitted to a fullnode's RPC port). Each txn can contain some maximum size of payload, and has a sequence number (the cosmos-sdk "nonce" field). These txns are delivered to the validators, who perform some preliminary checks (CheckTx) and then both gossip it to everyone else and add it to their own mempool. Later, when one of these validators becomes the "block proposer" and chooses to include the txn in a block, it is delivered to swingset and processed (which may create outbound messages, including an ACK of the highest-numbered inbound message).

We use a helper process (ag-cosmos-helper) to sign and transmit these transactions. We write all the messages to a temporary file, then invoke the helper. The helper can be run in a mode that waits for the txn to be accepted into the mempool, or a different mode that waits even longer, until the txn is committed in a block. The helper might signal an error if the RPC port is unreachable, or if the txn is rejected for some reason (too big, some gas limitation, mempool is full, sequence number doesn't look right).

We'll need a state machine that looks something like this:

I'm not sure what the kernel/mailbox API for this ought to be. I want something that comfortably satisfies this state machine, and also makes it easy to write TCP-ish connections like the solo-to-solo protocol. I strongly suspect we should implement #720 (the kernel input queue) first, to simplify the input side of the connectors. And we'll need some sort of interface to let host code know when it is safe to look at the outbound mailbox contents. A simple Promise (which purports to resolve when the kernel is idle) won't cut it: the kernel might be activated after it resolves that promise but before the host's callback gets to run. It will be better to use a callback interface instead, whose contract is to run while the kernel is idle and the mailbox is safe to read until the end of that turn. The host's outbox callback can poll and read the outbox immediately, but schedule the actual data transmission for a future turn (the kernel might start running again by that point, but the data it captured will still be correct/coherent).

Security Considerations

Messages must be delivered without corruption (otherwise forgeries could happen), but the main danger of bugs in this implementation are either messages being dropped or messages being delivered too many times. Since the mailbox protocol uses sequence numbers and deduplication, dropped messages are likely to cause communications to halt, and duplicate deliveries should be silently handled correctly (at a slight loss of performance for the extra+unused traffic).

Test Plan

The state machine should be unit tested in isolation. I'd like a larger integration test that uses a mock ag-cosmos-helper invocation function (no actual subprocesses) to exercise the various failure cases.

The text was updated successfully, but these errors were encountered:

warner · 2021-04-12T07:56:34Z

One additional limitation: if the set of outgoing messages is larger than will fit in a maximum-size transaction, we must only include the prefix of messages that do fit. We need to track which messages have been submitted and which have not. We can send more than one txn per block if we manage the seqnums more directly, which requires speculating about which messages will get in and which will not. There will be a tradeoff between complexity, into-chain bandwidth utilization, and latency.

When we get to gas limits and some kind of queueing fees, that will become a limitation too. @michaelfig likes the idea of a maximum number of pending txns from any given sender (a sort of token-passing scheme where you get the token back when the message is accepted into a block). I'm less optimistic, as the limit on traffic is then conditional on the independence (and limits on the quantity of) clients. A purely fee-driven approach seems the most sound to me, but of course we need a sensible way to denominate, collect, and distribute those fees.

warner · 2021-04-12T15:11:37Z

@dtribble reminds us to be on the lookout for accidentally-quadratic behavior. In particular, if the mailbox device produces a list of not-yet-acked messages, which grows over time if the target is not able to keep up, and a periodic process needs to copy some portion of this list somewhere (a txn for the solo-to-chain direction, or a published/provable part of the chain state vector for the chain-to-solo direction), that has the potential to consume O(n^2) time.

This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.

dckc · 2021-07-07T21:55:39Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhaul solo-to-chain txn sending loop #2855

overhaul solo-to-chain txn sending loop #2855

warner commented Apr 11, 2021 •

edited

Loading

warner commented Apr 12, 2021

warner commented Apr 12, 2021

dckc commented Jul 7, 2021 •

edited

Loading

warner commented Jul 14, 2021

michaelfig commented Jul 14, 2021

dckc commented Oct 27, 2021

overhaul solo-to-chain txn sending loop #2855

overhaul solo-to-chain txn sending loop #2855

Comments

warner commented Apr 11, 2021 • edited Loading

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

warner commented Apr 12, 2021

warner commented Apr 12, 2021

dckc commented Jul 7, 2021 • edited Loading

warner commented Jul 14, 2021

michaelfig commented Jul 14, 2021

dckc commented Oct 27, 2021

warner commented Apr 11, 2021 •

edited

Loading

dckc commented Jul 7, 2021 •

edited

Loading