-
-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: track syncing status and fetch duties on resynced #6995
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #6995 +/- ##
============================================
+ Coverage 49.10% 49.23% +0.12%
============================================
Files 577 578 +1
Lines 37336 37426 +90
Branches 2139 2162 +23
============================================
+ Hits 18334 18425 +91
+ Misses 18963 18961 -2
- Partials 39 40 +1 |
Performance Report✔️ no performance regression detected Full benchmark results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach here looks good.
I wonder if it can or should be tied to beacon health call in some way.
this.logger.debug("Connected beacon node is synced", {slot, ...syncingStatus}); | ||
} | ||
} catch (e) { | ||
this.logger.error("Failed to check syncing status", {}, e as Error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the UX of this (and can it be improved)? Most of the time we'll get two logs here, right? both health log and this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we shouldn't have two logs, that seems wrong, need to review
I think a good UX would be
verbose
: once a slot to print sync status ifisSyncing=false
warn
: once a slot ifisSyncing=true
error
: once a slot if failed to check status (e.g. due to node offline)
but in any case, just one of those, never multiple logs per slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both health log and this one?
ah you mean a log related to polling health status api? There is no log like this as far as I know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining ux question for me right now is how to deal with existing "Node is syncing" logs which are printed out due to other api failures, like polling duties and are already throttled to once per slot.
I aligned the logs to be somewhat similar
but it might look a bit redundant because getProposerDuties
is called every slot right now. I am still not sure if removing those altogether is good because it still provides information about polling failures + includes error message of beacon node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the UX is clean now
info
log to inform user about sync status just once on startup or on resyncedwarn
every slot if not synced with details about head slot / sync distanceverbose
dump node syncing status every slot, including all data retrieved from api
There might be additional "Node is syncing" warnings due to duty polling attempts running into 503 errors but those are slightly different in nature as (1) they include error details (hence important) and (2) depend on lifecycle of imported keys and might not even fire but I think it's good to inform the user about the node sync status in any case (think pending validator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could further consider the case if beacon node is synced but connected EL client is syncing or offline and emit a warning every slot. Might be out of scope though since status tracker service is primarily designed to trigger duties polling.
The syncing status call is more useful as the health status as it provides more detailed info instead of just the status code. Honestly we could consider removing the health status calls completely, the metric is unused and to observe health status of all nodes the approach to use the URL score seems better to me, and we already display this information on the validator client dashboard #6415. The health status api is mostly meant to be used by tooling like liveness / readiness probes in k8s or other health monitoring systems. |
would be in favor of that. it would also resolve the UX concern mentioned ^ |
Do we care about removing the metric, is anybody using it? I can't tell but likely not, I can migrate the logic to set the metric to syncing status tracker, it has all the information needed to set it as before |
Afaik its only used on the beacon side, not on the validator side (because of complications wrt maintaining state of backup beacon nodes). Would be in favor of removal in that case. |
I wonder if the infra team uses it for alerts @Faithtosin? It's mentioned here https://github.com/ChainSafe/lodestar-ansible-development/issues/57, and Lion tagged this as high prio issue #4637 but I fail to see how it was ever used, other than manually via exploring metrics via grafana or /metrics. |
@nflaig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Will you remove the health check + metric?
Have some stuff locally I need to push (mostly tests), was thinking about moving the metrics to syncing status tracker as we anyways have the data there already and we could add this to validator client dashboard, if we remove the metric there is no way from metrics to see if node is syncing. |
Added this panel to validator client dashboard now, I guess this was the original idea behind #4939 + could use this for alerting if infra team is interestd (cc @Faithtosin). It might be useful to detect if a validator client is only connected to a beacon node that is not ready to fulfill duties. But only helpful if there are no active keys yet on the validator client, otherwise should probably prefer to use failed duties as an alert trigger. |
🎉 This PR is included in v1.21.0 🎉 |
* feat: track syncing status and fetch duties on synced * Rename scheduling function to runOnResynced * Consider prev offline and syncing to trigger resynced event handlers * Add comment to error handler * Add note about el offline and sycning not considered * Align syncing status logs with existing node is syncing logs * Cleanup * Add ssz support to syncing status api * Align beacon node code to return proper types * Keep track of error in prev syncing status * Print slot in error log * Skip on first slot of epoch since tasks are already scheduled * Update api test data * Fix endpoint tests * await scheduled tasks, mostly relevant for testing * Add unit tests * Move beacon heath metric to syncing status tracker * Add beacon health panel to validator client dashboard * Formatting * Improve info called once assertion * Reset mocks after each test
Motivation
Connected beacon node might be syncing or is offline on validator client startup or on start of new epoch, in both cases, we wait until the next epoch to fetch validator duties again. This is not ideal as we can miss attestation and sync committee duties for a whole epoch (does not apply to block duties as we fetch those each slot).
Since all duty apis will return a 503 if node is syncing, we can keep track of syncing status and when it changes from syncing to synced we can fetch validator duties again as the request is very likely to succeed if previous syncing status check had no issues.
Tracking the syncing status of connected beacon node via
GET /eth/v1/node/syncing
is practically free and can be called once a slot, similar to tracking beacon health (see #4939).Description
getSyncingStatus
apiSame reasoning from #4939 applies
In a multi node setup, we will be tracking the syncing status of the first node that returns a successful response which in case of the getSyncingStatus should be most of the time as the only error condition is due to network errors, e.g. due to node being offline. In my opinion, this behavior is good enough considering how our fallback logic works and in a multi node setup, it is much less likely that all connected nodes are syncing and obtaining duties is much more reliable.
Closes #6962