-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross-Shard Congestion Control #539
Conversation
it's time to get first feedback by engineers outside the focus group
And a first draft of "the story behind" is also available: https://github.com/near/nearcore/blob/master/docs/architecture/how/receipt-congestion.md While the NEP focusses on specifying the proposed changes, the story behind explains our thought process why these changes lead to the desired consequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Some high level thoughts.
A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?
If so, then my question is about fairness and relatedly load balancing. The two cases that I am thinking of are:
- Shard A is congested and shard B and C both have a ton of receipts for it. Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C? Is it by relying on the linear interpolation?
- Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it. How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?
Yes, that sounds exactly right.
We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.
There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits, typos and such
Co-authored-by: wacban <wacban@users.noreply.github.com>
Generally happy with your responses here. One other approach I have seen (and implemented in the past) to guarantee fairness is some sort of credit based queuing. This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender. It is natural to use this mechanics to implement fair sharing or to arbitrary types of prioritisation as well (e.g. one shard is able to send 2x more than another). The drawback of course is more state tracking and complex implementation. So I'm happy with the proposed approach. |
neps/nep-0539.md
Outdated
|
||
We store the outgoing buffered receipts in the trie, similar to delayed receipts | ||
but in their own separate column. Therefore we add a trie column | ||
`BUFFERED_RECEIPT_OR_INDICES: u8 = 13;`. Then we read and write analogue to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail: I know we use this pattern for DELAYED_RECEIPT_OR_INDICES
, but it seems to be that way for historical reasons (see commit message here).
For this new queue it would be clearer to have separate BUFFERED_RECEIPT
and BUFFERED_RECEIPT_INDICES
columns.
Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard). Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above? |
The filter operation is only applicable to transactions, not to receipts. Once receipts are created, we commit to execute them. The described situation is indeed problematic. Of course, that's exactly what backpressure is for. If shard 3 becomes congested, shard 1 and 2 can still create receipts for shard 3 but they are forced to keep them in their outgoing buffer before forwarding. This way, shard 3 is protected from additional inflow. Eventually, shards 1 and 2 may also become congested and the backpressure spreads further out to all shards trying to send something to them. Eventually all shards are congested and no more new transactions anywhere are accepted. Unfortunately, it is still not handled perfectly. We only apply backpressure based on incoming congestion, to avoid deadlocks. But if we are able to handle incoming receipts quickly, it is possible shard 1 keeps filling its outgoing buffer for shard 2, growing it faster than it can forward receipts in it. But because the incoming queue is always empty, it does not apply backpressure. (cc @wacban we should probably simulate with the latest changes that decouple incoming and outgoing congested to see how bad this can become.) |
I think I understand the high level explanation. The drawback is that in the worst case, due to one shard not keeping up, it is possible that the entire network has to stop accepting new transactions. I am still happy with this solution and see this as a very good next step to build. Once built, I can imagine further refinements where we can address such cases as well. |
If I understand correctly this could be implemented by splitting the delayed receipts queue into one queue per sending shard and then implementing some fair way to pull receipts from this set of queues. This makes sense but I would rather keep this NEP in the current simpler form and work on top of it in follow ups. The good news is that as far as I can tell the current proposal should be easily extendable to what you're suggesting.
That is correct, just to add a detail to it, each shard will advertise two numbers, one representing the fullness of the "outgoing queues" and one representing the fullness of the "incoming queue". Those two types of congestion are treated differently which allows us to better adapt the measures to the specific workload that the network is under. |
@wacban: perfect, sounds like a solid plan to me. I am always happy to build incrementally. |
This document describes a few fundamental congestion control problems and ideas to solve them. The added page serves as secondary document to [NEP](near/NEPs#539) to summarise the thought process behind the most important design decisions. But it is generally applicable to congestion in Near Protocol's receipt execution system as it works today. It can even serve as documentation for how congestion can occur today. The document includes 8 graphs generated using [graphviz](https://graphviz.org/). To regenerate after modifying the `*.dot` files, install the graphviz toolbox (on systems with apt: `sudo apt install graphviz`) and then run `dot -Tsvg img_name.dot > img_name.svg`. --------- Co-authored-by: wacban <wacban@users.noreply.github.com>
- The formulas in the pseudo code were opposite to the description, fixing it by swapping incoming and general congestion. - "General" congestion is a bad name. Changing it to "Memory" congestion. - Add a sentence of motiviation to the pseudo code snippets for extra explanation - Add TODO for unbounded queue problem
Co-authored-by: wacban <wacban@users.noreply.github.com>
No link to the actual reference implementation, yet. Just some clarifying text and in-place code.
I think it's better to keep it simple. While it could be useful in the future to look at guaranteed to be burnt and attached gas separately for congestion, our current strategy does not look at it.
I implemented the model of the strategy proposed in the NEP. I am now analysing different workloads to make sure that the strategy can handle them well. I will be sharing results and suggestions here as I progress. AllToOne workload.In this workload all shards send direct transactions to a single shard that becomes congested. The strategy does a rather bad job at dealing with this workload as the outgoing buffers grow in gas without a reasonable limit. The memory limit is never exceeded because the receipts are small but the number and gas of receipts grows beyond acceptable values. The reason is that the current proposal does not take the gas accumulated in outgoing buffers into account. My suggestion would be to replace memory congestion with ShardChunkHeaderInnerV3 {
// as is
incoming_congestion: u16,
// memory -> general
general_congestion: u16,
}
I implemented the suggestion in the model and the results are quite good - both the incoming queue and outgoing buffers display bounded, periodic behaviour. In the picture below, each period is characterized by four phases:
We can probably smooth it out further by replacing the hard incoming congestion threshold with linear interpolation. It's not a priority right now so I'll leave it as is. |
Correct some typos, grammar issues, and clarify some text.
Address various comments by SME reviewers. - Fix various grammar errors. - Remove old names and use only the correct names for variables - Start the specification section by introducing important concepts
Thanks a lot to @Akashin and @robin-near for taking the time to read through our proposal and giving valuable feedback! I really appreciate your expertise to ensure we end up with the best possible solution to move congestion control one step forward. Sorry about the subpart quality in the grammar, and just in general. I thought we had the NEP cleaned up much better, otherwise I wouldn't have asked for SME reviews. I think we rushed a bit too much then, as we wanted to get the NEP processed started as soon as possible. I have tried my best to fix it up now and added a new section about important concepts. Please, @robin-near, can you take another look? Let me know if something is still not well defined or not written clearly. |
Oh and in the time since the last changes, we added "missed chunks congestion" as an additional indicator. I have added it to the concepts section and to the "Changes to chunk execution" section. It's a bit of a last minute change, not something we initially wanted to address. But for stateless validation, Near Protocol needs a way to limit incoming receipts even when chunks are missed. This NEP introduces all the required tools to solve that problem, so it seemed worth it to include. But if preferred by the working group, we could also separate it out as its own NEP that builds on top of congestion control. @wacban, since you spear-headed and implemented this, can you please double-check that I got the details around missed chunk congestion right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a working group member, I lean towards approving this proposal. While Near aims to scale such that it can handle the load users place on the network, it is still critical that Near remains usable under all loads. This congestion handling protocol accomplishes this goal while leaving room for transaction prioritization in the future.
One note I would like to make is that front-ends may need to update their retry logic to specifically handle the "transaction rejected due to congestion" error. This should be communicated clearly along with the protocol change which includes congestion control.
As a working group member, I lean towards approving this NEP. It is a major step towards addressing congestion related stability issues and improving the user experience of NEAR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a working group member I lean toward approving this proposal.
One observation. Say that an account on shard A wants to interact with a contract on shard B. Shard B is congested so that the transaction will be rejected. A (not so simple) alternative for the user is to route their transaction through a collocated contract on shard C (user -> A -> C -> B). The receipt between C and B will be delayed, but the transaction got in any way (is there any advantage to this?). Eventually, if everyone keeps doing this, C will get congested, and given there is nothing special about C, if these routing contracts are collocated on every shard and users do this, all shards will eventually get congested due to one app on shard B.
The situation described above exists beyond this proposal. I'm highlighting it since it will continue to exist.
Co-authored-by: Marcelo Fornet <mfornet94@gmail.com>
Co-authored-by: Michael Birch <birchmd8@gmail.com>
@robin-near You wrote that you want to take another look. Note that a WG meeting and the voting on the NEP is scheduled for this Friday. If you have any concerns about the proposal, please raise them as early as possible so they can be incorporated in the decision. |
As a working group member I lean toward approving this proposal. I have two meta comments:
|
High-level overview slides from today's WG call: https://docs.google.com/presentation/d/1zm0zZKnJpfGsj8-yo9tePqxd9CRhicKPcr1dDnePyVk/edit?usp=sharing |
Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we want to reject new transactions as early as possible when the receiver shard is already overloaded with traffic. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` different from existing errors.
Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we want to reject new transactions as early as possible when the receiver shard is already overloaded with traffic. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` different from existing errors.
Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we must deal with the case when a shard rejects new transactions. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` differently from existing errors. The key difference is that the same signed transaction can be sent again and will be accepted if congestion has gone down.
Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we must deal with the case when a shard rejects new transactions. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` differently from existing errors. The key difference is that the same signed transaction can be sent again and will be accepted if congestion has gone down.
# Feature to stabilize This PR stabilizes the Congestion Control and Stateless Validation protocol features. They are assigned separate protocol features and the protocol upgrades should be scheduled separately. # Context * near/NEPs#539 * near/NEPs#509 # Testing and QA Those features are well covered in unit, integration and end to end tests and were extensively tested in forknet and statelessnet. # Checklist - [x] Link to nightly nayduck run (`./scripts/nayduck.py`, [docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)): https://nayduck.nearone.org/ - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
# Feature to stabilize This PR stabilizes the Congestion Control and Stateless Validation protocol features. They are assigned separate protocol features and the protocol upgrades should be scheduled separately. # Context * near/NEPs#539 * near/NEPs#509 # Testing and QA Those features are well covered in unit, integration and end to end tests and were extensively tested in forknet and statelessnet. # Checklist - [x] Link to nightly nayduck run (`./scripts/nayduck.py`, [docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)): https://nayduck.nearone.org/ - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
# Feature to stabilize This PR stabilizes the Congestion Control and Stateless Validation protocol features. They are assigned separate protocol features and the protocol upgrades should be scheduled separately. # Context * near/NEPs#539 * near/NEPs#509 # Testing and QA Those features are well covered in unit, integration and end to end tests and were extensively tested in forknet and statelessnet. # Checklist - [x] Link to nightly nayduck run (`./scripts/nayduck.py`, [docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)): https://nayduck.nearone.org/ - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
Latest rendered view.
NEP Status (Updated by NEP Moderators)
Status: Approved
Meeting Recording:
https://www.youtube.com/watch?v=O1MOBmxKqhI
Protocol Work Group voting indications (β | π | π ):
Cross-Shard Congestion ControlΒ #539 (review)
Cross-Shard Congestion ControlΒ #539 (review)
Cross-Shard Congestion ControlΒ #539 (comment)
Cross-Shard Congestion ControlΒ #539 (comment)