Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhaul solo-to-chain txn sending loop #2855

Closed
warner opened this issue Apr 11, 2021 · 6 comments
Closed

overhaul solo-to-chain txn sending loop #2855

warner opened this issue Apr 11, 2021 · 6 comments
Assignees
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request

Comments

@warner
Copy link
Member

warner commented Apr 11, 2021

What is the Problem Being Solved?

@michaelfig and I were looking at the code in cosmic-swingset that delivers messages from a solo node to the chain (mostly in https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/lib/ag-solo/chain-cosmos-sdk.js and https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/lib/ag-solo/outbound.js). This code extracts outbound message payloads from the swingset "mailbox" device, and incorporates them into special cosmos-sdk transaction messages. It then submits these txns to an external helper program (ag-cosmos-helper) for signing and broadcast to the chain, via one of the configured RPC ports.

The current approach has several limitations:

  • all outbound messages wait for their txn to be included in a block: this is not guaranteed to happen quickly, especially if the chain is busy, and the mempool does not completely drain between blocks
    • this appears to include the time to process the kernel run-queue, which includes all vat deliveries triggered by the message
    • if those deliveries include the creation of a new dynamic vat, or the evaluation of significant amounts of code (like contract source), the crank may take 20-30s to finish
      • the helper process times out after 10s and assumes the txn has failed, although in fact it will succeed
      • when the helper tries again, it uses the old sequence number, which will be consumed by the succeeding txn, so the retry is rejected
  • only one outbound helper process is allowed to run at a time (for efficiency, and to avoid reusing seqnums), which limits us to one set of messages in flight per block
  • the code that prepares message sets for transmission does not catch up fast enough: each message set is frozen when the message is generated and the 1s Nagle timer expires, and when that set makes it through the queue (one set per block), it doesn't include all the messages waiting behind it
    • the consequence is that a busy ag-solo (e.g. being driven by a load-generator that produces a new cycle before the previous one has been completely retired, 20-30s) never catches up, and the queue depth increases without bound
    • each txn should include all messages that are waiting to be delivered to the chain, or at least all that can fit in the maximum txn size
  • the code doesn't limit the txn size at all; if there were a lot of messages to go out, they could exceed some max-txn size and fail, when it really ought to send the oldest N bytes each time

Description of the Design

The swingset mailbox device is designed to manage hangover-inconsistency prevention, and bridge the gap between the swingset+host's atomicity domain, and the message-delivery IO channel needed to deliver those messages. This mailbox holds a set of numbered messages that want to be delivered to each known remote system, as well as an ack number (the highest inbound sequence number we've seen from that system: we're telling them it's safe to stop publishing all lower-numbered messages because we've safely processed them already). At any given time, the kernel wants these messages to be made available to the remote system.

The host is responsible for executing a specific sequence of events:

  • when the kernel is idle, invoke device calls to submit inbound messages into the kernel
  • then use c.run() to turn the kernel crank until all work is done (or c.step() to limit the amount we do)
  • then commit all DB state: any crash before this point will come back up to the previous commitment point, any crash afterwards will come back up to this commitment point
  • then examine the mailbox and deliver the messages to the remote
    • the host must not allow messages to be released until the DB has committed, to prevent hangover inconsistency

The set of outbound messages grows when kernel activity causes messages to be sent to a remote system, and shrinks when an ACK is received from the remote system. In this sense, the outbound messages form a "pool", more than a queue. There is no "connection" to each remote system to go up and down: swingset works like Waterken.

The link to each remote system will be carried over various types of links. When the link is connection-oriented (coming and going as the processes on each end are rebooted, as TCP links are made and broken), the host must manage the mismatch between swingset's style and the TCP style.

When we get around to building a solo-to-solo protocol (#2484), it will need to try to maintain a TLS/TCP connection between the two sides. Each time this connection comes up, the sender should attempt to send all pending messages (because we don't know what previous messages might have made it through or not). But as long the connection remains up, we don't need to send any message twice. We'll need a state machine which tracks the ephemeral connection state and the message pool, sending messages when new ones are available and/or when the connection comes up each time.

For the solo-to-chain protocol, we don't use a long-running TCP connection. Instead, messages are delivered in signed transactions to the chain (submitted to a fullnode's RPC port). Each txn can contain some maximum size of payload, and has a sequence number (the cosmos-sdk "nonce" field). These txns are delivered to the validators, who perform some preliminary checks (CheckTx) and then both gossip it to everyone else and add it to their own mempool. Later, when one of these validators becomes the "block proposer" and chooses to include the txn in a block, it is delivered to swingset and processed (which may create outbound messages, including an ACK of the highest-numbered inbound message).

We use a helper process (ag-cosmos-helper) to sign and transmit these transactions. We write all the messages to a temporary file, then invoke the helper. The helper can be run in a mode that waits for the txn to be accepted into the mempool, or a different mode that waits even longer, until the txn is committed in a block. The helper might signal an error if the RPC port is unreachable, or if the txn is rejected for some reason (too big, some gas limitation, mempool is full, sequence number doesn't look right).

We'll need a state machine that looks something like this:

solo-to-chain-send

I'm not sure what the kernel/mailbox API for this ought to be. I want something that comfortably satisfies this state machine, and also makes it easy to write TCP-ish connections like the solo-to-solo protocol. I strongly suspect we should implement #720 (the kernel input queue) first, to simplify the input side of the connectors. And we'll need some sort of interface to let host code know when it is safe to look at the outbound mailbox contents. A simple Promise (which purports to resolve when the kernel is idle) won't cut it: the kernel might be activated after it resolves that promise but before the host's callback gets to run. It will be better to use a callback interface instead, whose contract is to run while the kernel is idle and the mailbox is safe to read until the end of that turn. The host's outbox callback can poll and read the outbox immediately, but schedule the actual data transmission for a future turn (the kernel might start running again by that point, but the data it captured will still be correct/coherent).

Security Considerations

Messages must be delivered without corruption (otherwise forgeries could happen), but the main danger of bugs in this implementation are either messages being dropped or messages being delivered too many times. Since the mailbox protocol uses sequence numbers and deduplication, dropped messages are likely to cause communications to halt, and duplicate deliveries should be silently handled correctly (at a slight loss of performance for the extra+unused traffic).

Test Plan

The state machine should be unit tested in isolation. I'd like a larger integration test that uses a mock ag-cosmos-helper invocation function (no actual subprocesses) to exercise the various failure cases.

@warner warner added enhancement New feature or request cosmic-swingset package: cosmic-swingset labels Apr 11, 2021
@warner
Copy link
Member Author

warner commented Apr 12, 2021

One additional limitation: if the set of outgoing messages is larger than will fit in a maximum-size transaction, we must only include the prefix of messages that do fit. We need to track which messages have been submitted and which have not. We can send more than one txn per block if we manage the seqnums more directly, which requires speculating about which messages will get in and which will not. There will be a tradeoff between complexity, into-chain bandwidth utilization, and latency.

When we get to gas limits and some kind of queueing fees, that will become a limitation too. @michaelfig likes the idea of a maximum number of pending txns from any given sender (a sort of token-passing scheme where you get the token back when the message is accepted into a block). I'm less optimistic, as the limit on traffic is then conditional on the independence (and limits on the quantity of) clients. A purely fee-driven approach seems the most sound to me, but of course we need a sensible way to denominate, collect, and distribute those fees.

@warner
Copy link
Member Author

warner commented Apr 12, 2021

@dtribble reminds us to be on the lookout for accidentally-quadratic behavior. In particular, if the mailbox device produces a list of not-yet-acked messages, which grows over time if the target is not able to keep up, and a periodic process needs to copy some portion of this list somewhere (a txn for the solo-to-chain direction, or a published/provable part of the chain state vector for the chain-to-solo direction), that has the potential to consume O(n^2) time.

warner added a commit that referenced this issue Apr 14, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Apr 16, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Apr 16, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Apr 29, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Apr 29, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue May 4, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue May 6, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
@dckc dckc mentioned this issue May 17, 2021
6 tasks
warner added a commit that referenced this issue Jun 3, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Jun 3, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Jun 20, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Jun 23, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Jun 26, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Jun 29, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
warner added a commit that referenced this issue Jul 1, 2021
This is a horrible hack that should not land on trunk.

The solo-to-chain delivery pathway has a bug in which heavy-ish traffic (a
load generator which emits messages faster than once per block or two) causes
a queue to build up without bound, causing message delivery to fall further
and further behind. Each helper invocation sends a few more messages, but
does not send *all* the remaining messages. Given enough traffic, this can
lead to a queue that takes days to drain, when it could really be processed
in under a minute.

We need to overhaul this delivery pathway (#2855). This patch is a quick and
dirty hack to get a load-generator running faster than one cycle per 20s. It
doesn't drain the queue any faster, but each time it invokes the helper to
send a few messages, it calls back into the mailbox to copy *all* the current
messages out. The helper arguments (including the tempfile name) are already
fixed and recorded in the queue, but this hack just replaces the *contents*
of the tempfile just before invoking the helper.

We still have a queue that takes days to drain, but once the load generation
frontend is turned off, the actual new messages are delivered within a block
or two, and the remaining day of activity consists entirely of empty or
duplicate messages, which the chain then correctly ignores.
@dckc
Copy link
Member

dckc commented Jul 7, 2021

see also VatTP over IBC #1670

p.s. this ticket should show up in searches for "deliverator"

@JimLarson JimLarson self-assigned this Jul 12, 2021
@warner
Copy link
Member Author

warner commented Jul 14, 2021

@michaelfig 's #3452 design made me recognize a different way of thinking about the swingset-to-outside-world delivery path that I wanted to capture. Previously, I treated swingset (and specifically the mailbox state object) as the sole source of truth for pending/unacked outbound messages. That's still true in the long run, but we can consider a "delivery agent" who is guaranteed to remain alive for the full duration of the kernel process. That shared lifetime makes it safe to delegate responsibility for delivery of any given message to the agent.

After a controller.run(), the mailbox-state scanner notices that there are new outbound messages to send. It hands them to the delivery agent with the instructions "make sure this gets to the target, don't give up until I tell you they are acked, but if you die before that point, that's ok, because I die too, and when I wake up again, I'll tell you to deliver them again". The agent now has responsibility until 1: the state scanner observes these messages are missing from the kernel's outbound set, or 2: the process terminates. The kernel+state-scanner can use a simple stream of outbound messages, like #3452 does, and the delivery agent does not need to re-poll the kernel after each delivery attempt completes.

@michaelfig oh, this makes me realize a problem with #3452: if the message is accepted by the RPC server, does that mean it's really accepted into swingset? I think the answer is no, if the mempool is full, or if the message gets kicked out of the mempool somehow. And if that happens to messages 1+2, what happens to the 3+4 that's following them? Our code is probably good enough for now (and certainly it wasn't any better before), but when our backpressure story is mature enough for messages to get dropped outside the visibility of the deliverator, we need a way to recover from there.

Since we're planning to leave this ticket open until we have a txn size limit in place, let's also use it to track this dropped-after-delivery situation.

@michaelfig
Copy link
Member

if the message is accepted by the RPC server, does that mean it's really accepted into swingset?

Yes, eventually. If it wasn't, then that would be a bug in Tendermint.

I think the answer is no, if the mempool is full,

Full mempool makes the RPC node return an error, even with the default --broadcast-mode=sync that #3452 is using.

or if the message gets kicked out of the mempool somehow.

The ability to kick a tx out of the mempool is not implemented by our chain right now, and I maintain it's a really bad idea to start (even if it was possible without deep Tendermint changes). It breaks the simple and robust end-to-end "store-and-forward" architecture of all our existing software and leads to communication failure modes that are undetectable by the ag-solo.

I'm quite sure that it is possible to implement correct backpressure without relying on the "silently kick out" misfeature.

@dckc
Copy link
Member

dckc commented Oct 27, 2021

@michaelfig tells me the remaining work in this area is best tracked in other issues.

@dckc dckc closed this as completed Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants