Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heal timeout fix #25605

Closed
wants to merge 5 commits into from
Closed

Heal timeout fix #25605

wants to merge 5 commits into from

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Aug 25, 2022

This is meant to fix #25600 .
Work in progress.

Will run the version with "15 cap" versus "no cap" (same as master + some more metrics), on two azure-nodes each, in order to get some measurements on how the cap affects the overall packet lossiness.

ansible-playbook playbook.yaml -t geth -l bootnode-azure-westus-001,bootnode-azure-koreasouth-001  -e "geth_image=holiman/geth-experimental:latest" -e "geth_datadir_wipe=partial"
ansible-playbook playbook.yaml -t geth -l bootnode-azure-brazilsouth-001,bootnode-azure-australiaeast-001  -e "geth_image=holiman/geth-master:latest" -e "geth_datadir_wipe=partial"

westus, koreasouth: this PR (cap on pending)
brazilsouth, australiaeast: no cap on pending

@holiman
Copy link
Contributor Author

holiman commented Aug 26, 2022

Some preliminary figures. The machines are not yet in the heal-phase, and the prior phase seems to not be as afflicted. The meters show the counf of unexpected replies ( == replies which we have timed out, most likely because they've been stuck in the inbox for too long), and the count of expected replies.
Screenshot 2022-08-26 at 08-20-18 Single Geth - Grafana

The numbers are as follows, for the two nodes with no cap on pending:

>>> 118/6.6
17.87878787878788
>>> 122/9
13.555555555555555

So e.g 122K requests handled fine, but 9K requests timed out. Which means 13 ok requests for every failed one.
For the ones with a cap on pending:

>>> 82/2.3
35.652173913043484
>>> 75/2.4
31.25

So these ones have issued a lot less requests, but also wasted less responses: it has 30+ ok requests for every failed one.

These numbers might change when the nodes hit the state-heal phase.

@holiman
Copy link
Contributor Author

holiman commented Aug 27, 2022

All nodes are now in the heal phase, and the numbers have changed a bit

australia east: 1.29M vs 239k ==> 5.4
brazil south: 1.66M vs 270K ==> 6.14
westus: 322K vs 28k  ==> 11.5
korea-south: 307K vs 21k ==> 14

So the 'normal' nodes have a ration of 5 ok for each bad and the cappeed ones are better. However, the uncapped ones have managed to download about 4x as much data during this period, so that definitely outweighs any benefits of this PR.

The uncapped ones have basically thrown away almost as much as the capped ones have downloaded.

@holiman holiman closed this Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficient response handling in Snap
1 participant