Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase priority for validator HTTP requests #6292

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

michaelsproul
Copy link
Member

Issue Addressed

Address an issue reported by Stakely during Obol testing (for Lido) whereby charon gets stuck in a timeout loop trying to request the details of inactive validators from Lighthouse.

The error from Charon is:

14:51:30.489 ERRO vapi Validator api 5xx response: fetching non-cached validators from BN: beacon api validators: http request timeout: context deadline exceeded {"status_code": 500, "message": "Internal server error", "duration": "2.001067392s", "label": "validators", "url": "http://beaconnode:5051/eth/v1/beacon/states/head/validators?id=XXX", "method": "Get", "vapi_endpoint": "get_validator"}
app/eth2wrap/eth2wrap.go:206 .wrapError
app/eth2wrap/eth2wrap_gen.go:648 .Validators
core/validatorapi/validatorapi.go:1021 .Validators
core/validatorapi/router.go:1372 .getValidatorsByID
core/validatorapi/router.go:388 .func7
core/validatorapi/router.go:311 .func1
app/app.go:954 .func2

Proposed Changes

This PR increases the priority for validator info requests from P1 to P0, until we can do the hard work of overhauling the priority system:

This is a low-risk change, but will require a new LH release before it can be used in production by Obol users. I will try to get Stakely to run it on testnets.

Additional Info

There is a workaround that achieves a similar effect: --http-enable-beacon-processor false. However this comes with the downside of opening Lighthouse's HTTP API up to accidental DoS. With no limit to the number of concurrent requests, it is easy to overwhelm Lighthouse with requests and cause OOMs and other slowness.

@michaelsproul michaelsproul added ready-for-review The code is ready for review optimization Something to make Lighthouse run more efficiently. v6.0.0 New major release for hierarchical state diffs val-client Relates to the validator client binary HTTP-API labels Aug 22, 2024
@michaelsproul michaelsproul added the low-hanging-fruit Easy to resolve, get it before someone else does! label Aug 30, 2024
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is a low risk change. LGTM.

How often does charon query the BN for validators? My concern is if there is a large number of inactive validators, it could overwhelm the BN and making it P0 could make things worse.

We've had performance issue with this endpoint before - and changed our poll frequency to once per epoch. Our VC also avoid polling on the first slot of the epoch (change made in #5628)

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Aug 30, 2024
@michaelsproul
Copy link
Member Author

I don't think Charon sets the poll frequency because it is just middleware between the VC and the BN. So Lighthouse VC is in control of the polling schedule and should only be making requests once per epoch as of v5.2.0 (when #5628 was included). I've asked Stakely to confirm which version of Lighthouse VC they were running. If they were running v5.1.3 or earlier then it will have been spamming every slot and making the problem worse.

I also don't think the API performance was too bad here. We quite regularly hit 2s+ response times for P1 requests on Holesky, and this is usually fine for non-critical requests. The problem is that Charon times out after 2s and spews error logs. Because the validators are not active, Lighthouse VC (through Charon) will keep requesting, and even if some of these requests succeed, the error logs can come back the next time a request times out. So my original description of a "timeout loop" is not quite accurate: charon is likely just timing out repeatedly.

Stakely reported that the issue disappeared completely when running with --http-enable-beacon-processor false, which to me also points to a scheduling issue around P1 more than a general performance issue. I think this also implies that the BN can handle these requests at P0 which is essentially what --http-enable-beacon-processor false does (makes every request P0 with no rate-limiting).

@michaelsproul
Copy link
Member Author

@mergify queue

Copy link

mergify bot commented Aug 30, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at ae83901

mergify bot added a commit that referenced this pull request Aug 30, 2024
@mergify mergify bot merged commit ae83901 into sigp:unstable Aug 30, 2024
28 checks passed
AgeManning pushed a commit to AgeManning/lighthouse that referenced this pull request Sep 3, 2024
* Increase priority for validator HTTP requests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HTTP-API low-hanging-fruit Easy to resolve, get it before someone else does! optimization Something to make Lighthouse run more efficiently. ready-for-merge This PR is ready to merge. v6.0.0 New major release for hierarchical state diffs val-client Relates to the validator client binary
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants