Inefficient response handling in Snap #25600

holiman · 2022-08-25T06:53:46Z

Note This bug report is somewhat speculative, I am not 100% certain that the description here is correct.

I think I know why this PR (#25588) does not make any difference.

This is the machine charts. A few observations can be made:

There is always 128-150 pending (outstanding) requests
The Snap trienode handle times are most commonly in 3-7s bucket, but quite often up to 20 seconds.

So if all our responses have already arrived, handling them will take ~1280 seconds, or 21 minutes. So once we get to actually handle them, they will have timed out in the snap layer, although they were fine in the p2p packet tracker layer.

This is wasteful: we are making requests at a higher rate than we can handle, and thus ignoring responses and refetching stuff. We need to adjust the mechanism so that we do not request more than we can handle -- alternatively, fix the timeout management so that we do not time out deliveries which have already arrived and are just waiting in the queue.

The text was updated successfully, but these errors were encountered:

holiman · 2022-08-25T09:53:22Z

In sync.go, every time the loop runs, we will assign/issue another trienode heal request. So once an item has ben lying on the queue for ~15 minutes, we issue and send out a new request. It may be served quickly by the remote peer, but will not be handled until after it has timed out, 15-20 minutes later.

A better model in this case would be to tune down the 128 pending ones down to maybe 10. The node in question has ~300 peers -- I guess this problem doesn't normally occur unless the node has a lot of peers.

OBS: this is a consequence of disk IO speed, not network lag.

holiman · 2022-08-25T12:08:33Z

Here's one hour of execution, where I have set a hard upper limit of pending trie requests at 15.

The Unexpected trienode heal log messages almost (but not quite) go away completely.

The code I used is at this branch: https://github.com/holiman/go-ethereum/tree/heal_timeout_fix .

It places the cap very un-dynamically at the place where we schedule requests. It could be applied in several different places, and could also be made dynamic -- e.g. adjusted so that the maxPending goes down if the mean handle-time goes up. Suggestions appreciated.

holiman · 2022-09-01T08:58:37Z

Fix: #25651

holiman added type:bug status:triage labels Aug 25, 2022

This was referenced Aug 25, 2022

Need help with 1.10.22 regression, in 1.10.23 I get Unhandled trie error: missing trie node #25589

Closed

Heal timeout fix #25605

Closed

holiman closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient response handling in Snap #25600

Inefficient response handling in Snap #25600

holiman commented Aug 25, 2022

holiman commented Aug 25, 2022 •

edited

Loading

holiman commented Aug 25, 2022

holiman commented Sep 1, 2022

Inefficient response handling in Snap #25600

Inefficient response handling in Snap #25600

Comments

holiman commented Aug 25, 2022

holiman commented Aug 25, 2022 • edited Loading

holiman commented Aug 25, 2022

holiman commented Sep 1, 2022

holiman commented Aug 25, 2022 •

edited

Loading