Increase priority for validator HTTP requests #6292

michaelsproul · 2024-08-22T00:30:09Z

Issue Addressed

Address an issue reported by Stakely during Obol testing (for Lido) whereby charon gets stuck in a timeout loop trying to request the details of inactive validators from Lighthouse.

The error from Charon is:

14:51:30.489 ERRO vapi Validator api 5xx response: fetching non-cached validators from BN: beacon api validators: http request timeout: context deadline exceeded {"status_code": 500, "message": "Internal server error", "duration": "2.001067392s", "label": "validators", "url": "http://beaconnode:5051/eth/v1/beacon/states/head/validators?id=XXX", "method": "Get", "vapi_endpoint": "get_validator"}
app/eth2wrap/eth2wrap.go:206 .wrapError
app/eth2wrap/eth2wrap_gen.go:648 .Validators
core/validatorapi/validatorapi.go:1021 .Validators
core/validatorapi/router.go:1372 .getValidatorsByID
core/validatorapi/router.go:388 .func7
core/validatorapi/router.go:311 .func1
app/app.go:954 .func2

Proposed Changes

This PR increases the priority for validator info requests from P1 to P0, until we can do the hard work of overhauling the priority system:

Improve beacon processor scheduling #6291

This is a low-risk change, but will require a new LH release before it can be used in production by Obol users. I will try to get Stakely to run it on testnets.

Additional Info

There is a workaround that achieves a similar effect: --http-enable-beacon-processor false. However this comes with the downside of opening Lighthouse's HTTP API up to accidental DoS. With no limit to the number of concurrent requests, it is easy to overwhelm Lighthouse with requests and cause OOMs and other slowness.

jimmygchen

Agree this is a low risk change. LGTM.

How often does charon query the BN for validators? My concern is if there is a large number of inactive validators, it could overwhelm the BN and making it P0 could make things worse.

We've had performance issue with this endpoint before - and changed our poll frequency to once per epoch. Our VC also avoid polling on the first slot of the epoch (change made in #5628)

michaelsproul · 2024-08-30T04:15:51Z

I don't think Charon sets the poll frequency because it is just middleware between the VC and the BN. So Lighthouse VC is in control of the polling schedule and should only be making requests once per epoch as of v5.2.0 (when #5628 was included). I've asked Stakely to confirm which version of Lighthouse VC they were running. If they were running v5.1.3 or earlier then it will have been spamming every slot and making the problem worse.

I also don't think the API performance was too bad here. We quite regularly hit 2s+ response times for P1 requests on Holesky, and this is usually fine for non-critical requests. The problem is that Charon times out after 2s and spews error logs. Because the validators are not active, Lighthouse VC (through Charon) will keep requesting, and even if some of these requests succeed, the error logs can come back the next time a request times out. So my original description of a "timeout loop" is not quite accurate: charon is likely just timing out repeatedly.

Stakely reported that the issue disappeared completely when running with --http-enable-beacon-processor false, which to me also points to a scheduling issue around P1 more than a general performance issue. I think this also implies that the BN can handle these requests at P0 which is essentially what --http-enable-beacon-processor false does (makes every request P0 with no rate-limiting).

michaelsproul · 2024-08-30T04:34:16Z

@mergify queue

mergify · 2024-08-30T04:34:29Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at ae83901

* Increase priority for validator HTTP requests

Increase priority for validator HTTP requests

23d33c6

michaelsproul added ready-for-review The code is ready for review optimization Something to make Lighthouse run more efficiently. v6.0.0 New major release for hierarchical state diffs val-client Relates to the validator client binary HTTP-API labels Aug 22, 2024

michaelsproul mentioned this pull request Aug 22, 2024

Tweak timeouts for fetching validators from BN ObolNetwork/charon#3237

Open

michaelsproul added the low-hanging-fruit Easy to resolve, get it before someone else does! label Aug 30, 2024

jimmygchen approved these changes Aug 30, 2024

View reviewed changes

jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Aug 30, 2024

mergify bot added a commit that referenced this pull request Aug 30, 2024

Merge of #6292

6a65593

This was referenced Aug 30, 2024

merge queue: embarking unstable (100f33a) and [#6292 + #6331] together #6333

Closed

Return HTTP 404 for pruned blob requests #6331

Merged

mergify bot merged commit ae83901 into sigp:unstable Aug 30, 2024
28 checks passed

AgeManning pushed a commit to AgeManning/lighthouse that referenced this pull request Sep 3, 2024

Increase priority for validator HTTP requests (sigp#6292)

3930a73

* Increase priority for validator HTTP requests

chong-he pushed a commit to chong-he/lighthouse that referenced this pull request Nov 26, 2024

Increase priority for validator HTTP requests (sigp#6292)

21f7c32

* Increase priority for validator HTTP requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase priority for validator HTTP requests #6292

Increase priority for validator HTTP requests #6292

michaelsproul commented Aug 22, 2024

jimmygchen left a comment

michaelsproul commented Aug 30, 2024

michaelsproul commented Aug 30, 2024

mergify bot commented Aug 30, 2024 •

edited

Loading

Increase priority for validator HTTP requests #6292

Increase priority for validator HTTP requests #6292

Conversation

michaelsproul commented Aug 22, 2024

Issue Addressed

Proposed Changes

Additional Info

jimmygchen left a comment

Choose a reason for hiding this comment

michaelsproul commented Aug 30, 2024

michaelsproul commented Aug 30, 2024

mergify bot commented Aug 30, 2024 • edited Loading

✅ The pull request has been merged automatically

mergify bot commented Aug 30, 2024 •

edited

Loading