RPC responses from go impl polling incomplete bytes from substream #505
Labels
Network
Libp2p and PubSub stuff
Priority: 2 - High
Very important and should be addressed ASAP
Type: Bug
Something isn't working
Describe the bug
Bytes polled from the substream from go impl of Blocksync more often than not will fail to decode because only partial bytes will be pulled from the substream. Was very inconsistent and hard to reproduce within implementation but now with Kademlia and varying network latency it's much more easily reproducible.
Error will look something like:
but even when putting a long thread sleep right before the bytes are polled, the error is somewhat consistently:
and I didn't dig deep enough to find out where the 8192 cap is from, whether it's the substream (on either go or rust impl), cap on the bytes polled from the substream, or some other limit but it's too consistent to be a coincidence. Because of whatever limit, probably leads to having to either decode the response from some reader of pulling from the substream (definitely inefficiencies there as probably shouldn't happen in the poll function) or storing chunks of unfinished bytes polled from those substreams (also inefficient for keeping unnecessary bytes in memory and doing unnecessary decoding checks until finished).
In any case, the main issue should be found out why the substream is polled as ready when bytes have only started to be written. I wasn't able to reproduce from within our client but maybe I just didn't create a large enough tipset bundle or didn't simulate how go implementation writes to the substream (I believe their cbor encoding is just using a writer and gets written as encoded)
My guess is that the way the RPC and even rust libp2p isn't built to handle such large messages over the network (blocksync tipset bundles are very large in practical scenarios) so there probably has to be a refactor of how the RPC module is setup
To Reproduce
Steps to reproduce the behavior:
Once #501 comes in, can reproduce very consistently with connecting to testnet (default bootnodes and genesis so just running the node)
Log output
Log Output
Expected behaviour
BlockSync responses should not fail to decode
Screenshots
Environment (please complete the following information):
rustc --version
)Other information and links
The text was updated successfully, but these errors were encountered: