You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having spent some time by now browsing through logs, it got me thinking about the following.
The problem:
Duty is uniquely defined by slot, duty-type and validator
when troubleshooting the most important question we typically want to answer is what happened to a particular duty execution (why it failed, what took too long, ...). The current "reliable" approach seems to be to search logs by slot and then manually trim the result down. We can filter additionally by duty-type and validator (which can be represented either by index or by pubkey) but that becomes less "reliable" because with the current implementation different fields are added to logger in different places - it's hard to guarantee that every log line related to a particular duty contains all 3 ( slot+duty-type+validator(index || pubkey) ). Plus it's easy to make a mistake when composing such filter(s). And we might also find false-positive unrelated lines this way.
For committee duty, for example, this 3-filter pattern doesn't seem to work because we don't attach validator identifiers (index or pubkey) to logger (we only log committee validator(s) identifiers in a few select places). We can try to add it ofc, but implementing a single point of management for this instead of further spreading it out across whole code base (as well as do it at the earliest possible moment - which is the duty start / p2p message arrival - so we "mark" all the logs we can) seems like better long-term solution.
Suggested solution (general idea):
duty-trace-id is a unique duty identifier constructed as slot+duty-type+validator(index) (it's pretty similar to duty_id which is just a strictly "weaker" version of duty-trace-id, see more on it below)
duty-trace-id comes from 2 different sources: 1) locally started duty processing flow 2) messages received from peers as soon as msg is successfully parsed (but before msg validation kicks in)
duty-trace-id needs to be added/logged for every duty-related log (can be done by adding it to logger object via logger.With) but only for duty execution (we probably don't want to add duty-trace-id to duty fetching flow, but we can)
(optional, see alternative below) duty-trace-id needs to be added to every p2p duty-related message (so it can be immediately parsed and logged on the receiving side), it is is extra data to pass around in p2p network (although it's just 1 small field - so shouldn't make much difference) which is undesirable
Suggested solution (details + concerns):
committee duty doesn't 100% fit into the design outlined above for two reasons: 1) duty-trace-id must become duty-trace-ids to include all committee validators 2) we probably don't want to pass 13 duty-trace-ids in every duty-related message for committee duty, so the alternative would be implement a mechanism to re-construct duty-trace-ids from duty-related messages (perhaps not only specifically for committee duty but other duty types as well - to fully get rid of duty-trace-id in duty-related message(s)) immediately after such message is received and successfully parsed (or at least partially parsed so we can re-construct duty-trace-id)
we need to find a simple mechanism to pass duty-trace-id around the code to every place that needs it: 1) logger for every line logged 2) all external message senders IF we need to send duty-trace-id in p2p message(s); the most straightforward and futureproof way to do 2 is to create a wrapper for zap.Logger that will proxy every method call we need from its API but also expose a getter(s) for querying values previously set via logger.With (to fetch duty-trace-id wherever we need it)
this mechanism won't help with debugging lost p2p messages (still, it will probably be an improvement on what we currently have with respect to linking successfully received message with the relevant duty in logs at the earliest possible moment)
validator can be identified either by index or by pubkey, we probably want to use validator index for duty-trace-id (but can also look up corresponding pubkey and add it to logger at the same time duty-trace-id is added to have both present in logs at all times - currently pubkey log-field is playing this role)
if we want to differentiate local-originating and peer-originating duty execution events we can add suffix(s) to duty-trace-id (_local, _peer), but that's probably unnecessary given the amount of logs per unique duty is not that high; or maybe even better - we can just do logger.With("event_origin", "local") / logger.With("event_origin", "external") at the same time we do logger.With("duty_trace_id", dutyTraceID)
Additional note, there is already a field called duty_id (eg. ATTESTER-e88635-s2836320-v1675642) initialized by trySetDutyID func, but
it gets added to logger quite late in message processing cycle (in ProcessMessage func), so it doesn't show the full picture when we filter logs by duty_id (log messages like "❗ could not handle message" won't be found)
plus duty_id is not set for local node events, and I think we would benefit from linking these too with duty-trace-id
for committee duty it doesn't include validator index (eg. COMMITTEE-272_274_290_305-e88647-s2836722), so it's much less useful
So what we probably want to do is to rework (to extend) this duty_id concept in the way I described above.
The text was updated successfully, but these errors were encountered:
Having spent some time by now browsing through logs, it got me thinking about the following.
The problem:
Duty is uniquely defined by
slot
,duty-type
andvalidator
when troubleshooting the most important question we typically want to answer is what happened to a particular duty execution (why it failed, what took too long, ...). The current "reliable" approach seems to be to search logs by
slot
and then manually trim the result down. We can filter additionally byduty-type
andvalidator
(which can be represented either by index or by pubkey) but that becomes less "reliable" because with the current implementation different fields are added tologger
in different places - it's hard to guarantee that every log line related to a particular duty contains all 3 (slot
+duty-type
+validator(index || pubkey)
). Plus it's easy to make a mistake when composing such filter(s). And we might also find false-positive unrelated lines this way.For committee duty, for example, this 3-filter pattern doesn't seem to work because we don't attach
validator
identifiers (index or pubkey) tologger
(we only log committee validator(s) identifiers in a few select places). We can try to add it ofc, but implementing a single point of management for this instead of further spreading it out across whole code base (as well as do it at the earliest possible moment - which is the duty start / p2p message arrival - so we "mark" all the logs we can) seems like better long-term solution.Suggested solution (general idea):
duty-trace-id
is a unique duty identifier constructed asslot
+duty-type
+validator(index)
(it's pretty similar toduty_id
which is just a strictly "weaker" version ofduty-trace-id
, see more on it below)duty-trace-id
comes from 2 different sources: 1) locally started duty processing flow 2) messages received from peers as soon as msg is successfully parsed (but before msg validation kicks in)duty-trace-id
needs to be added/logged for every duty-related log (can be done by adding it tologger
object vialogger.With
) but only for duty execution (we probably don't want to addduty-trace-id
to duty fetching flow, but we can)duty-trace-id
needs to be added to every p2p duty-related message (so it can be immediately parsed and logged on the receiving side), it is is extra data to pass around in p2p network (although it's just 1 small field - so shouldn't make much difference) which is undesirableSuggested solution (details + concerns):
duty-trace-id
must becomeduty-trace-ids
to include all committee validators 2) we probably don't want to pass 13duty-trace-ids
in every duty-related message for committee duty, so the alternative would be implement a mechanism to re-constructduty-trace-ids
from duty-related messages (perhaps not only specifically for committee duty but other duty types as well - to fully get rid ofduty-trace-id
in duty-related message(s)) immediately after such message is received and successfully parsed (or at least partially parsed so we can re-constructduty-trace-id
)duty-trace-id
around the code to every place that needs it: 1)logger
for every line logged 2) all external message senders IF we need to sendduty-trace-id
in p2p message(s); the most straightforward and futureproof way to do 2 is to create a wrapper forzap.Logger
that will proxy every method call we need from its API but also expose a getter(s) for querying values previously set vialogger.With
(to fetchduty-trace-id
wherever we need it)validator
can be identified either by index or by pubkey, we probably want to use validator index forduty-trace-id
(but can also look up corresponding pubkey and add it tologger
at the same timeduty-trace-id
is added to have both present in logs at all times - currentlypubkey
log-field is playing this role)duty-trace-id
(_local
,_peer
), but that's probably unnecessary given the amount of logs per unique duty is not that high; or maybe even better - we can just dologger.With("event_origin", "local")
/logger.With("event_origin", "external")
at the same time we dologger.With("duty_trace_id", dutyTraceID)
Additional note, there is already a field called
duty_id
(eg.ATTESTER-e88635-s2836320-v1675642
) initialized bytrySetDutyID
func, butlogger
quite late in message processing cycle (inProcessMessage
func), so it doesn't show the full picture when we filter logs byduty_id
(log messages like"❗ could not handle message"
won't be found)duty_id
is not set for local node events, and I think we would benefit from linking these too withduty-trace-id
validator
index (eg.COMMITTEE-272_274_290_305-e88647-s2836722
), so it's much less usefulSo what we probably want to do is to rework (to extend) this
duty_id
concept in the way I described above.The text was updated successfully, but these errors were encountered: