logs: duty_trace_id #1828

iurii-ssv · 2024-10-29T14:05:44Z

Having spent some time by now browsing through logs, it got me thinking about the following.

The problem:

Duty is uniquely defined by slot, duty-type and validator

when troubleshooting the most important question we typically want to answer is what happened to a particular duty execution (why it failed, what took too long, ...). The current "reliable" approach seems to be to search logs by slot and then manually trim the result down. We can filter additionally by duty-type and validator (which can be represented either by index or by pubkey) but that becomes less "reliable" because with the current implementation different fields are added to logger in different places - it's hard to guarantee that every log line related to a particular duty contains all 3 ( slot+duty-type+validator(index || pubkey) ). Plus it's easy to make a mistake when composing such filter(s). And we might also find false-positive unrelated lines this way.

For committee duty, for example, this 3-filter pattern doesn't seem to work because we don't attach validator identifiers (index or pubkey) to logger (we only log committee validator(s) identifiers in a few select places). We can try to add it ofc, but implementing a single point of management for this instead of further spreading it out across whole code base (as well as do it at the earliest possible moment - which is the duty start / p2p message arrival - so we "mark" all the logs we can) seems like better long-term solution.

Suggested solution (general idea):

duty-trace-id is a unique duty identifier constructed as slot+duty-type+validator(index) (it's pretty similar to duty_id which is just a strictly "weaker" version of duty-trace-id, see more on it below)
duty-trace-id comes from 2 different sources: 1) locally started duty processing flow 2) messages received from peers as soon as msg is successfully parsed (but before msg validation kicks in)
duty-trace-id needs to be added/logged for every duty-related log (can be done by adding it to logger object via logger.With) but only for duty execution (we probably don't want to add duty-trace-id to duty fetching flow, but we can)
(optional, see alternative below) duty-trace-id needs to be added to every p2p duty-related message (so it can be immediately parsed and logged on the receiving side), it is is extra data to pass around in p2p network (although it's just 1 small field - so shouldn't make much difference) which is undesirable

Suggested solution (details + concerns):

committee duty doesn't 100% fit into the design outlined above for two reasons: 1) duty-trace-id must become duty-trace-ids to include all committee validators 2) we probably don't want to pass 13 duty-trace-ids in every duty-related message for committee duty, so the alternative would be implement a mechanism to re-construct duty-trace-ids from duty-related messages (perhaps not only specifically for committee duty but other duty types as well - to fully get rid of duty-trace-id in duty-related message(s)) immediately after such message is received and successfully parsed (or at least partially parsed so we can re-construct duty-trace-id)
we need to find a simple mechanism to pass duty-trace-id around the code to every place that needs it: 1) logger for every line logged 2) all external message senders IF we need to send duty-trace-id in p2p message(s); the most straightforward and futureproof way to do 2 is to create a wrapper for zap.Logger that will proxy every method call we need from its API but also expose a getter(s) for querying values previously set via logger.With (to fetch duty-trace-id wherever we need it)
this mechanism won't help with debugging lost p2p messages (still, it will probably be an improvement on what we currently have with respect to linking successfully received message with the relevant duty in logs at the earliest possible moment)
validator can be identified either by index or by pubkey, we probably want to use validator index for duty-trace-id (but can also look up corresponding pubkey and add it to logger at the same time duty-trace-id is added to have both present in logs at all times - currently pubkey log-field is playing this role)
if we want to differentiate local-originating and peer-originating duty execution events we can add suffix(s) to duty-trace-id (_local, _peer), but that's probably unnecessary given the amount of logs per unique duty is not that high; or maybe even better - we can just do logger.With("event_origin", "local") / logger.With("event_origin", "external") at the same time we do logger.With("duty_trace_id", dutyTraceID)

Additional note, there is already a field called duty_id (eg. ATTESTER-e88635-s2836320-v1675642) initialized by trySetDutyID func, but

it gets added to logger quite late in message processing cycle (in ProcessMessage func), so it doesn't show the full picture when we filter logs by duty_id (log messages like "❗ could not handle message" won't be found)
plus duty_id is not set for local node events, and I think we would benefit from linking these too with duty-trace-id
for committee duty it doesn't include validator index (eg. COMMITTEE-272_274_290_305-e88647-s2836722), so it's much less useful

So what we probably want to do is to rework (to extend) this duty_id concept in the way I described above.

The text was updated successfully, but these errors were encountered:

iurii-ssv added the enhancement New feature or request label Oct 29, 2024

zktaiga assigned oleg-ssvlabs Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logs: duty_trace_id #1828

logs: duty_trace_id #1828

iurii-ssv commented Oct 29, 2024 •

edited

Loading

logs: duty_trace_id #1828

logs: duty_trace_id #1828

Comments

iurii-ssv commented Oct 29, 2024 • edited Loading

iurii-ssv commented Oct 29, 2024 •

edited

Loading