-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overhaul solo-to-chain txn sending loop #2855
Comments
One additional limitation: if the set of outgoing messages is larger than will fit in a maximum-size transaction, we must only include the prefix of messages that do fit. We need to track which messages have been submitted and which have not. We can send more than one txn per block if we manage the seqnums more directly, which requires speculating about which messages will get in and which will not. There will be a tradeoff between complexity, into-chain bandwidth utilization, and latency. When we get to gas limits and some kind of queueing fees, that will become a limitation too. @michaelfig likes the idea of a maximum number of pending txns from any given sender (a sort of token-passing scheme where you get the token back when the message is accepted into a block). I'm less optimistic, as the limit on traffic is then conditional on the independence (and limits on the quantity of) clients. A purely fee-driven approach seems the most sound to me, but of course we need a sensible way to denominate, collect, and distribute those fees. |
@dtribble reminds us to be on the lookout for accidentally-quadratic behavior. In particular, if the mailbox device produces a list of not-yet-acked messages, which grows over time if the target is not able to keep up, and a periodic process needs to copy some portion of this list somewhere (a txn for the solo-to-chain direction, or a published/provable part of the chain state vector for the chain-to-solo direction), that has the potential to consume O(n^2) time. |
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
This is a horrible hack that should not land on trunk. The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a load generator which emits messages faster than once per block or two) causes a queue to build up without bound, causing message delivery to fall further and further behind. Each helper invocation sends a few more messages, but does not send *all* the remaining messages. Given enough traffic, this can lead to a queue that takes days to drain, when it could really be processed in under a minute. We need to overhaul this delivery pathway (#2855). This patch is a quick and dirty hack to get a load-generator running faster than one cycle per 20s. It doesn't drain the queue any faster, but each time it invokes the helper to send a few messages, it calls back into the mailbox to copy *all* the current messages out. The helper arguments (including the tempfile name) are already fixed and recorded in the queue, but this hack just replaces the *contents* of the tempfile just before invoking the helper. We still have a queue that takes days to drain, but once the load generation frontend is turned off, the actual new messages are delivered within a block or two, and the remaining day of activity consists entirely of empty or duplicate messages, which the chain then correctly ignores.
see also VatTP over IBC #1670 p.s. this ticket should show up in searches for "deliverator" |
@michaelfig 's #3452 design made me recognize a different way of thinking about the swingset-to-outside-world delivery path that I wanted to capture. Previously, I treated swingset (and specifically the mailbox state object) as the sole source of truth for pending/unacked outbound messages. That's still true in the long run, but we can consider a "delivery agent" who is guaranteed to remain alive for the full duration of the kernel process. That shared lifetime makes it safe to delegate responsibility for delivery of any given message to the agent. After a @michaelfig oh, this makes me realize a problem with #3452: if the message is accepted by the RPC server, does that mean it's really accepted into swingset? I think the answer is no, if the mempool is full, or if the message gets kicked out of the mempool somehow. And if that happens to messages 1+2, what happens to the 3+4 that's following them? Our code is probably good enough for now (and certainly it wasn't any better before), but when our backpressure story is mature enough for messages to get dropped outside the visibility of the deliverator, we need a way to recover from there. Since we're planning to leave this ticket open until we have a txn size limit in place, let's also use it to track this dropped-after-delivery situation. |
Yes, eventually. If it wasn't, then that would be a bug in Tendermint.
Full mempool makes the RPC node return an error, even with the default
The ability to kick a tx out of the mempool is not implemented by our chain right now, and I maintain it's a really bad idea to start (even if it was possible without deep Tendermint changes). It breaks the simple and robust end-to-end "store-and-forward" architecture of all our existing software and leads to communication failure modes that are undetectable by the ag-solo. I'm quite sure that it is possible to implement correct backpressure without relying on the "silently kick out" misfeature. |
@michaelfig tells me the remaining work in this area is best tracked in other issues. |
What is the Problem Being Solved?
@michaelfig and I were looking at the code in cosmic-swingset that delivers messages from a solo node to the chain (mostly in https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/lib/ag-solo/chain-cosmos-sdk.js and https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/lib/ag-solo/outbound.js). This code extracts outbound message payloads from the swingset "mailbox" device, and incorporates them into special cosmos-sdk transaction messages. It then submits these txns to an external helper program (
ag-cosmos-helper
) for signing and broadcast to the chain, via one of the configured RPC ports.The current approach has several limitations:
Description of the Design
The swingset mailbox device is designed to manage hangover-inconsistency prevention, and bridge the gap between the swingset+host's atomicity domain, and the message-delivery IO channel needed to deliver those messages. This mailbox holds a set of numbered messages that want to be delivered to each known remote system, as well as an ack number (the highest inbound sequence number we've seen from that system: we're telling them it's safe to stop publishing all lower-numbered messages because we've safely processed them already). At any given time, the kernel wants these messages to be made available to the remote system.
The host is responsible for executing a specific sequence of events:
c.run()
to turn the kernel crank until all work is done (orc.step()
to limit the amount we do)The set of outbound messages grows when kernel activity causes messages to be sent to a remote system, and shrinks when an ACK is received from the remote system. In this sense, the outbound messages form a "pool", more than a queue. There is no "connection" to each remote system to go up and down: swingset works like Waterken.
The link to each remote system will be carried over various types of links. When the link is connection-oriented (coming and going as the processes on each end are rebooted, as TCP links are made and broken), the host must manage the mismatch between swingset's style and the TCP style.
When we get around to building a solo-to-solo protocol (#2484), it will need to try to maintain a TLS/TCP connection between the two sides. Each time this connection comes up, the sender should attempt to send all pending messages (because we don't know what previous messages might have made it through or not). But as long the connection remains up, we don't need to send any message twice. We'll need a state machine which tracks the ephemeral connection state and the message pool, sending messages when new ones are available and/or when the connection comes up each time.
For the solo-to-chain protocol, we don't use a long-running TCP connection. Instead, messages are delivered in signed transactions to the chain (submitted to a fullnode's RPC port). Each txn can contain some maximum size of payload, and has a sequence number (the cosmos-sdk "nonce" field). These txns are delivered to the validators, who perform some preliminary checks (
CheckTx
) and then both gossip it to everyone else and add it to their own mempool. Later, when one of these validators becomes the "block proposer" and chooses to include the txn in a block, it is delivered to swingset and processed (which may create outbound messages, including an ACK of the highest-numbered inbound message).We use a helper process (
ag-cosmos-helper
) to sign and transmit these transactions. We write all the messages to a temporary file, then invoke the helper. The helper can be run in a mode that waits for the txn to be accepted into the mempool, or a different mode that waits even longer, until the txn is committed in a block. The helper might signal an error if the RPC port is unreachable, or if the txn is rejected for some reason (too big, some gas limitation, mempool is full, sequence number doesn't look right).We'll need a state machine that looks something like this:
I'm not sure what the kernel/mailbox API for this ought to be. I want something that comfortably satisfies this state machine, and also makes it easy to write TCP-ish connections like the solo-to-solo protocol. I strongly suspect we should implement #720 (the kernel input queue) first, to simplify the input side of the connectors. And we'll need some sort of interface to let host code know when it is safe to look at the outbound mailbox contents. A simple Promise (which purports to resolve when the kernel is idle) won't cut it: the kernel might be activated after it resolves that promise but before the host's callback gets to run. It will be better to use a callback interface instead, whose contract is to run while the kernel is idle and the mailbox is safe to read until the end of that turn. The host's outbox callback can poll and read the outbox immediately, but schedule the actual data transmission for a future turn (the kernel might start running again by that point, but the data it captured will still be correct/coherent).
Security Considerations
Messages must be delivered without corruption (otherwise forgeries could happen), but the main danger of bugs in this implementation are either messages being dropped or messages being delivered too many times. Since the mailbox protocol uses sequence numbers and deduplication, dropped messages are likely to cause communications to halt, and duplicate deliveries should be silently handled correctly (at a slight loss of performance for the extra+unused traffic).
Test Plan
The state machine should be unit tested in isolation. I'd like a larger integration test that uses a mock
ag-cosmos-helper
invocation function (no actual subprocesses) to exercise the various failure cases.The text was updated successfully, but these errors were encountered: