Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Some load & peer notification queue #6261

Open
NikVolf opened this issue Jun 5, 2020 · 15 comments
Open

Some load & peer notification queue #6261

NikVolf opened this issue Jun 5, 2020 · 15 comments

Comments

@NikVolf
Copy link
Contributor

NikVolf commented Jun 5, 2020

With moderate load of transactions on a node, some of them are dropped and not propagated, leading to some of the rest invalidated (since validaty of a single transaction may depend on some other transaction).

The load is not at all high, just ~2000 tx/sec and there are just 3 peers.

These are probably related log messages:

2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:26 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:28 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCEKPCMakdQdJ8JFyCMKS23Uqy7MReAPuWg3TRuD3uXc2 is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
2020-06-05 11:22:29 📞 Notifications queue with peer 12D3KooWCoYzb5ivEynuCVnzycziNuWw1vfnchLHSAEsiSgjJwot is full, dropped message (protocol: [47, 115, 117, 112, 47, 116, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 115, 47, 49])
@NikVolf NikVolf changed the title Heavy load & peer notification queue Some load & peer notification queue Jun 5, 2020
@bkchr
Copy link
Member

bkchr commented Jun 5, 2020

Probably duplicate of: #5481 ?

@arkpar
Copy link
Member

arkpar commented Jun 5, 2020

The issue here is that when we try and send too many transactions per second notifications queue quickly becomes full. This is not limited by bandwidth though, but rather our architecture.
There are two possible solutions here:

  1. Fix sc_network::NetworkService::write_notifications should have proper back-pressure #5481
  2. Introduce transaction batching. Right now each transaction gets moved through 3 or 4 async queues on the way from incoming socket(RPC or peer networking) to the transaction pool and to the propagation socket. Ideally we should minimize number of items in these queues by batching transactions together and passing them as a group on each step. Here's a plan to make this work:
    2.1. RPC should allow submitting multiple transactions at once
    2.2. libp2p Network handler should allow sending and receiving multiple transactions at once. Protocol already supports that and I changed the API to accept batches in a-fix-transaction-propagation branch.
    2.3. Transaction pool should process and send notifications for batches, rather than individual transactions.

Alternatively we could accumulate and delay sending out transactions. This is has a downside of introducing additional latency.

cc @tomaka @tomusdrw

@tomusdrw
Copy link
Contributor

tomusdrw commented Jun 5, 2020

Batching sounds good, I was also discussing batch-verification with @NikVolf the other day. Currently we call into the runtime to validate a single transaction (via validate_transaction) but it should be trivial to pass a batch to the runtime as well.

@NikVolf
Copy link
Contributor Author

NikVolf commented Jun 5, 2020

Not a fan of batching on RPC level to just make this work. This is not how people generally use the node.

@tomusdrw
Copy link
Contributor

tomusdrw commented Jun 5, 2020

AFAIU the RPC batching is just going to be an option, not a requirement. I.e. Currently you can ONLY submit transactions one-by-one and we could add one more method that allows sending a batch of transactions.

@tomaka
Copy link
Contributor

tomaka commented Jun 5, 2020

I don't understand the batching solution.

Right now we have a buffer that can accept M items.

If I understand correctly, with the batching solution we would still have M items in the buffer, except that each item would be a Vec of N sub-items.
Why not just increase the size of the buffer to M*N? This is what #6080 does.

@NikVolf
Copy link
Contributor Author

NikVolf commented Jun 5, 2020

AFAIU the RPC batching is just going to be an option, not a requirement. I.e. Currently you can ONLY submit transactions one-by-one and we could add one more method that allows sending a batch of transactions.

I mean if to fix this problem user will be required to batch transactions himself before invoking RPC, this is not a proper solution to me 🤷‍♀️

@arkpar
Copy link
Member

arkpar commented Jun 5, 2020

Why not just increase the size of the buffer to M*N? This is what #6080 does.

Batching is preferable to just increasing the size of the buffer.

  1. It reduces the number of times async task is woken up. My profiling shows tokio/futures overhead for each wakeup/poll is still significant. There's a lot of overhead that we introduce as well. E.g. each transaction event is reported to telemetry.
  2. It also reduces network traffic.

@arkpar
Copy link
Member

arkpar commented Jun 5, 2020

AFAIU the RPC batching is just going to be an option, not a requirement. I.e. Currently you can ONLY submit transactions one-by-one and we could add one more method that allows sending a batch of transactions.

I mean if to fix this problem user will be required to batch transactions himself before invoking RPC, this is not a proper solution to me 🤷‍♀️

I guess RPC handler could read incoming HTTP or WS stream and batch messages internally as long as there's more calls of the same type in the stream. But I don't know if our RPC crate allows that.

Batching could also be done on the transaction pool level. I.e. transaction pool has a queue of transactions that are pending signature check and validation. When a single transaction is validated it checks if the queue is empty before sending notifications. If not it would wait for a certain number of transactions to be validated.

@tomaka
Copy link
Contributor

tomaka commented Jun 5, 2020

Can we not just test #6080, which is a two lines change, instead of complicating everything and designing new things again?

@gavofyork
Copy link
Member

Can we not just test #6080, which is a two lines change, instead of complicating everything and designing new things again?

#6080 is a workaround for a bad design. It can be used for doing demos and temporarily pushing up throughput, but shouldn't be considered an "alternative solution".

This needs fixing properly.

@tomaka
Copy link
Contributor

tomaka commented Jun 5, 2020

I agree on the "bad design" if you're talking about #5481. I completely disagree if you're referring to our channels system.

There are two problems in play:

Solutions such as "batching messages" fall in the category of "improving the performances", but improving the performances is already a work-around for #5481.
Whatever we do our code will not be fundamentally correct before #5481 is fixed.

#6080 is also in the category of "improving the performances", and I still don't understand why it's being blocked.

@gavofyork
Copy link
Member

I agree that #5481 should be sorted. However, it's not clear to me that batching transactions isn't a plainly sensible optimisation. Arbitrarily increasing queue sizes much less so.

@arkpar
Copy link
Member

arkpar commented Jun 5, 2020

@tomaka would #6080 really help here? I thought this const would have to be incresed.

My main concern regarding #6080 is that it may increase memory usage significantly when there's a lot of peers and the node is overloaded with gossip messages.

It is not that much of a problem for transactions as there's not much of them on the real network and we only re-propagate after validation, making it harder to spam all of the network through a single connection. I'm not sure this is the case for consensus gossip though

@tomaka
Copy link
Contributor

tomaka commented Jun 5, 2020

@tomaka would #6080 really help here? I thought this const would have to be incresed.

I'm not sure of what I'm saying (which is why I'm advocating for trying, rather than affirming that this is a fix), but I think that the small buffer size that #6080 touches causes the network worker to sleep, during which messages accumulate.
In other words: if the network worker didn't sleep, then messages would be pushed at a slower rate than they are being processed, but since we have these intermittent sleeping, it's the opposite that happens. When the network worker wakes up, we suddenly push thousands of messages at once.

I haven't mentioned that in the issue, but I'm indeed working on bypassing that network worker channel for notifications (i.e. GrandPa would push a message to a channel that is processed by the background task dedicated to the node, and wouldn't go through the network worker task at all).
It's far from being trivial for various reasons, so it's taking a bit of time, but eventually it would reduce the number of messages that go through that channel and achieve the same goal.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants