Heal timeout fix #25605

holiman · 2022-08-25T19:57:32Z

This is meant to fix #25600 .
Work in progress.

Will run the version with "15 cap" versus "no cap" (same as master + some more metrics), on two azure-nodes each, in order to get some measurements on how the cap affects the overall packet lossiness.

ansible-playbook playbook.yaml -t geth -l bootnode-azure-westus-001,bootnode-azure-koreasouth-001  -e "geth_image=holiman/geth-experimental:latest" -e "geth_datadir_wipe=partial"
ansible-playbook playbook.yaml -t geth -l bootnode-azure-brazilsouth-001,bootnode-azure-australiaeast-001  -e "geth_image=holiman/geth-master:latest" -e "geth_datadir_wipe=partial"

westus, koreasouth: this PR (cap on pending)
brazilsouth, australiaeast: no cap on pending

holiman · 2022-08-26T06:25:11Z

Some preliminary figures. The machines are not yet in the heal-phase, and the prior phase seems to not be as afflicted. The meters show the counf of unexpected replies ( == replies which we have timed out, most likely because they've been stuck in the inbox for too long), and the count of expected replies.

The numbers are as follows, for the two nodes with no cap on pending:

>>> 118/6.6
17.87878787878788
>>> 122/9
13.555555555555555

So e.g 122K requests handled fine, but 9K requests timed out. Which means 13 ok requests for every failed one.
For the ones with a cap on pending:

>>> 82/2.3
35.652173913043484
>>> 75/2.4
31.25

So these ones have issued a lot less requests, but also wasted less responses: it has 30+ ok requests for every failed one.

These numbers might change when the nodes hit the state-heal phase.

holiman · 2022-08-27T09:30:44Z

All nodes are now in the heal phase, and the numbers have changed a bit

australia east: 1.29M vs 239k ==> 5.4
brazil south: 1.66M vs 270K ==> 6.14
westus: 322K vs 28k  ==> 11.5
korea-south: 307K vs 21k ==> 14

So the 'normal' nodes have a ration of 5 ok for each bad and the cappeed ones are better. However, the uncapped ones have managed to download about 4x as much data during this period, so that definitely outweighs any benefits of this PR.

The uncapped ones have basically thrown away almost as much as the capped ones have downloaded.

holiman added 5 commits August 25, 2022 12:07

eth/protocols/snap: add cap on pending trienode heal requests

30703f0

f

9ec2c40

f

87412d7

eth/protocols/snap: loss metrics

02ddb15

eth/protocols/snap: cap on pending

4b7c11b

holiman closed this Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heal timeout fix #25605

Heal timeout fix #25605

holiman commented Aug 25, 2022 •

edited

Loading

holiman commented Aug 26, 2022

holiman commented Aug 27, 2022

Heal timeout fix #25605

Heal timeout fix #25605

Conversation

holiman commented Aug 25, 2022 • edited Loading

holiman commented Aug 26, 2022

holiman commented Aug 27, 2022

holiman commented Aug 25, 2022 •

edited

Loading