Tweak timeouts for fetching validators from BN #3237

michaelsproul · 2024-08-22T01:51:23Z

🐞 Bug Report

Description

Based on reports from Stakely, Charon sometimes times out when requesting validator info from Lighthouse:

14:51:30.489 ERRO vapi Validator api 5xx response: fetching non-cached validators from BN: beacon api validators: http request timeout: context deadline exceeded {"status_code": 500, "message": "Internal server error", "duration": "2.001067392s", "label": "validators", "url": "http://beaconnode:5051/eth/v1/beacon/states/head/validators?id=XXX", "method": "Get", "vapi_endpoint": "get_validator"}
app/eth2wrap/eth2wrap.go:206 .wrapError
app/eth2wrap/eth2wrap_gen.go:648 .Validators
core/validatorapi/validatorapi.go:1021 .Validators
core/validatorapi/router.go:1372 .getValidatorsByID
core/validatorapi/router.go:388 .func7
core/validatorapi/router.go:311 .func1
app/app.go:954 .func2

Part of the reason for this is that Lighthouse considers these request low-priority. I have opened a PR on Lighthouse to change this:

Increase priority for validator HTTP requests sigp/lighthouse#6292

However, in the meantime I think there are probably some changes charon could make to make this more reliable.

The error log seems to show a request for /eth/v1/beacon/states/head/validators?id=XXX with a single ID. It's possible that timeouts could be avoided by batching multiple IDs in one request. Based on my reading of go-eth2-client, it already has the ability to use the more efficient POST method which can handle an unbounded number of pubkey requests in one go:

https://github.com/attestantio/go-eth2-client/blob/490d07a8e0c258f4528d3039109696679d79787d/http/validators.go#L81

Further, it could be good to give charon users the ability to adjust the timeouts used for communicating with the beacon node. I couldn't find in the charon code where the timeout is set, but it seems to be 2s based on the error. On beacon nodes that are struggling under load (or heavily deprioritising charon's requests as in the case of Lighthouse) a fixed timeout that is too short is just going to lead to indefinitely repeating requests. Giving users the ability to lengthen this timeout could mitigate this. Dynamic timeouts a la exponential backoff could also be an option, but are more complicated to implement.

Has this worked before in a previous version?

Not sure.

🔬 Minimal Reproduction

Run charon with Lighthouse and 1000+ inactive validator keys.

🔥 Error

See above.

🌍 Your Environment

Not sure. I can check with Stakely.

The text was updated successfully, but these errors were encountered:

github-actions bot added the protocol Protocol Team tickets label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak timeouts for fetching validators from BN #3237

Tweak timeouts for fetching validators from BN #3237

michaelsproul commented Aug 22, 2024

Tweak timeouts for fetching validators from BN #3237

Tweak timeouts for fetching validators from BN #3237

Comments

michaelsproul commented Aug 22, 2024

🐞 Bug Report

Description

Has this worked before in a previous version?

🔬 Minimal Reproduction

🔥 Error

🌍 Your Environment