-
-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve jsonRPC error UX for eth1 + execution #5949
Conversation
Performance Report✔️ no performance regression detected Full benchmark results
|
e3c106a
to
cbcce56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks really good to me ❤️
could you generate some example logs by testing a few scenarios and how how they look before i approve
The log format does not change. it still shows the error exactly like before.
The only change would be the error will be shown once connection have any issue. |
can we have the "eth1 communication restored/online" etc kind of messages like how you did in the engine when state changes |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! great cleanup ❤️ and hopefully clean UX
} | ||
|
||
private async fetchJson<R, T = unknown>(json: T, opts?: ReqOpts): Promise<R> { | ||
if (this.urls.length === 0) throw Error("No url provided"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously removed a similar check like this in #5884 as we already check the URLs in the constructor
lodestar/packages/beacon-node/src/eth1/provider/jsonRpcHttpClient.ts
Lines 115 to 117 in 3cfa9cd
if (urls.length === 0) { | |
throw Error("No urls provided to JsonRpcHttpClient"); | |
} |
I don't think this error is reachable but if it is we should make sure to validate the URLs earlier.
@@ -35,3 +36,29 @@ describe("waitFor", () => { | |||
await expect(waitFor(() => true, {interval, timeout, signal: controller.signal})).to.be.rejectedWith(ErrorAborted); | |||
}); | |||
}); | |||
|
|||
describe("waitForElapsedTime", () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test case name does not match function name, looks like this was missed during renaming
🎉 This PR is included in v1.12.0 🎉 |
// only log error if state switched from online to some other state | ||
} else if (!isErrorAborted(e)) { | ||
this.logger.error("Error updating eth1 chain cache", {}, e as Error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nazarhussain can you please explain this change, does the comment relate to the removed log?
And is the log on line 183 supposed to be replaced by the error log added in the eth1 provider?
"Eth1Provider faced error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this log line should be replaced with eth1Provider error handling, but because we log the error once per minute so it could be the case you don't see the errors often.
} | ||
|
||
if (this.state !== Eth1ProviderState.ONLINE) { | ||
if (isOneMinutePassed()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nazarhussain can you please explain how this should behave
- should it invoke the error log at most once a minute
- or should it just invoke the error log once and never again, until there is a period of 1 minute where there are no errors?
See issue reported on discord, we are not actually printing out any errors and just spam the logs with info: Eth1Provider is back online oldState=ERROR, newState=ONLINE
.
I pushed a few changes that might partially solve this by fixing createElapsedTimeTracker
function, see unstable...nflaig/eth1-error-logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue was reported to this, see discord. Does not seem to happen if EL is still synincing but needs more investigation as they use openexecution and not directly connect to EL client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ever JsonRpcHttpClient
encounter an error, it bubbles up to this function. We log the the error once per minute because once the rpc encounter error it's usually because the RPC node is offline or some other issue then it's stay that way very long we end up tons of error logs with same error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We log the the error once per minute
that's what it should do but the function has a bug, fixed in unstable...nflaig/eth1-error-logging
The other issue I see is how we do the status logging, as single failed request will set the status to ERROR
while a single success request sets it to ONLINE
.
If some request succeed and some fail which is the cause if nethermind can't serve historical logs but it can serve other requests then what ends up happning is that the logs are spammed with
info: Eth1Provider is back online oldState=ERROR, newState=ONLINE.
It more of an issue of the intended behavior vs. the acutal behavior, if we fix the implementation it should be fine. I added more details in the issue #6469.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the discussions I had for it's implementation, that was the inspiration or this change:
also maybe every minute log the error message as well if the eth1 is still down ... ( i.e. if we make the request and its still error and the last error display message > 60 seconds , log the message and update last error display time)
So yes there is a bug in the implementation, but not in the createElapsedTimeTracker
. That one seems to be good as what is intended for, it's just tracking time since first invocation.
For eth1
provider we should do via rate limiting helpers for rate limit error per minute.
Motivation
Manage the state transition for the eth1 provider to log the errors with right context.
Description
Closes #4348
Steps to test or reproduce