-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Some load & peer notification queue #6261
Comments
Probably duplicate of: #5481 ? |
The issue here is that when we try and send too many transactions per second notifications queue quickly becomes full. This is not limited by bandwidth though, but rather our architecture.
Alternatively we could accumulate and delay sending out transactions. This is has a downside of introducing additional latency. |
Batching sounds good, I was also discussing batch-verification with @NikVolf the other day. Currently we call into the runtime to validate a single transaction (via |
Not a fan of batching on RPC level to just make this work. This is not how people generally use the node. |
AFAIU the RPC batching is just going to be an option, not a requirement. I.e. Currently you can ONLY submit transactions one-by-one and we could add one more method that allows sending a batch of transactions. |
I don't understand the batching solution. Right now we have a buffer that can accept If I understand correctly, with the batching solution we would still have |
I mean if to fix this problem user will be required to batch transactions himself before invoking RPC, this is not a proper solution to me 🤷♀️ |
Batching is preferable to just increasing the size of the buffer.
|
I guess RPC handler could read incoming HTTP or WS stream and batch messages internally as long as there's more calls of the same type in the stream. But I don't know if our RPC crate allows that. Batching could also be done on the transaction pool level. I.e. transaction pool has a queue of transactions that are pending signature check and validation. When a single transaction is validated it checks if the queue is empty before sending notifications. If not it would wait for a certain number of transactions to be validated. |
Can we not just test #6080, which is a two lines change, instead of complicating everything and designing new things again? |
#6080 is a workaround for a bad design. It can be used for doing demos and temporarily pushing up throughput, but shouldn't be considered an "alternative solution". This needs fixing properly. |
I agree on the "bad design" if you're talking about #5481. I completely disagree if you're referring to our channels system. There are two problems in play:
Solutions such as "batching messages" fall in the category of "improving the performances", but improving the performances is already a work-around for #5481. #6080 is also in the category of "improving the performances", and I still don't understand why it's being blocked. |
I agree that #5481 should be sorted. However, it's not clear to me that batching transactions isn't a plainly sensible optimisation. Arbitrarily increasing queue sizes much less so. |
@tomaka would #6080 really help here? I thought this const would have to be incresed. My main concern regarding #6080 is that it may increase memory usage significantly when there's a lot of peers and the node is overloaded with gossip messages. It is not that much of a problem for transactions as there's not much of them on the real network and we only re-propagate after validation, making it harder to spam all of the network through a single connection. I'm not sure this is the case for consensus gossip though |
I'm not sure of what I'm saying (which is why I'm advocating for trying, rather than affirming that this is a fix), but I think that the small buffer size that #6080 touches causes the network worker to sleep, during which messages accumulate. I haven't mentioned that in the issue, but I'm indeed working on bypassing that network worker channel for notifications (i.e. GrandPa would push a message to a channel that is processed by the background task dedicated to the node, and wouldn't go through the network worker task at all). |
With moderate load of transactions on a node, some of them are dropped and not propagated, leading to some of the rest invalidated (since validaty of a single transaction may depend on some other transaction).
The load is not at all high, just ~2000 tx/sec and there are just 3 peers.
These are probably related log messages:
The text was updated successfully, but these errors were encountered: