Increase tracing within hcsshim #1280

helsaawy · 2022-01-19T19:31:37Z

Portions of hcsshim are un-instrumented in terms of debug logs and spans.

This PR increases the logging and tracing, and ties span generation to logging level.

Several opinions are taken within this PR:

Spans are always generated and samples for major operations, even if logging level is Warning or higher. However, they are exported at log level Info. These events include:
a. ttrpc spans of the form containerd.task.v2.Task.*
b. containerd-shim-runhcs-v1 service spans of the form Create, State, etc...
Spans are also created on public or long running methods and go routines (such as waitBackground) for code in internal/, but are only sampled if the logging level is higher than Trace.
a. Other functions may use a .Trace( logging call to denote that a particular function is called, if they do not run long enough or are not deemed important enough to warrant the overhead of creating a span.
Log entries are now stored within contexts, using the log.S() and log.WithContext(). They can be extracted with log.FromContext(). log.G() has also been updated to look for an entry stored in the context.
a. This allows setting fields that will be present for all logs in the same or derived contexts.
Most other logs generated in functions are at the Debug level, and are used to log what parameters or options are being used, that a particular object was created, or an action was finished
Additionally, functions the send or receive requests over the bridge/guest connection should label their spans with kinds SpanKindClient or SpanKindServer, respectively. This is done automatically for TTRPC and gRPC servers.

Functions that are called from functions with sampled spans from containerd-shim-runhcs-v1 explicitly use StartTraceSpan to force their sampling to be tied to the logging level. Functions that are only called from internal code use the stock trace.StartSpace and inherit their sampler.

A nuance of this is that if a span is started with the LoggingLevelSampler or via oc.StartTraceSpan and the logging level is less than Trace, it will not be logged even if it registers an error and has a non-zero return status, since the span was disabled and will not reach the exporter stage.

Logging level can be passed in by the LogLevel setting in the containerd.toml file, so tracing can be enabled there.

cmd/containerd-shim-runhcs-v1/exec_clone.go

jterry75 · 2022-02-10T18:14:09Z

I havent thought about this in ages so asking a few questions. For spans, don't we inherit the baggage for the trace statements? Is it better to have a param as baggage or as a specific trace? IE: PodId, TaskId, ContainerId, Pid seem like they should be on the span given they help to correlate all data. Or are we wanting them on the trace and use the spanid,traceid and look back to correlate?

helsaawy · 2022-02-10T20:07:46Z

I havent thought about this in ages so asking a few questions. For spans, don't we inherit the baggage for the trace statements? Is it better to have a param as baggage or as a specific trace? IE: PodId, TaskId, ContainerId, Pid seem like they should be on the span given they help to correlate all data. Or are we wanting them on the trace and use the spanid,traceid and look back to correlate?

I was toying with that, especially if we put parameters only in the Spans, then debug statements from that same function will have no information if the spans are not printed. I was thinking about modifying log.G(ctx) to use the attributes added to its parent span to the returned log entry, but then (1) if the Span has a very large string attribute (ie, a whole json spec as a string), that will be printed for all logs in that span, and (2) if the Span is not sampled, adding attributes no-ops, so log statements wont have those fields.

I have no idea what the convention is, but the only reason I had for putting params solely in the spans and then correlating them back, was for performance reason, but I have no idea what the performance hit logging/tracing add.

jterry75 · 2022-02-10T20:40:03Z

My 2c, span's should have identifying attributes, added by the method itself as not to assume its caller and be created on all entry and exit scopes (ie: the proto api, or maybe a method call out to a kernel or opengcs we dont control).

IE: a method CreateContainer(ctx, id, config) (string, error) this span should have identifying information about the container, the id that is passed in from the caller, but still inherit any baggage from the caller if present. Some of this context came from the calling API, but the ContainerId is created in this method and thus is not part of the span. It should be added to the baggage of the calling span on return.

However, the config of the container should not be on the span. It's not identifying attributes. It should be a log statement upon entry to the method about the entry args and actions being taken. However, I get the problem with sampling so this can create its own host of issues.

But sometimes this doesnt exactly work. For example, containerd has CRI id -> Task Id -> Container ID -> Container Pid. From the calling span (CRI) we only know exactly one of the id's. From the gcs we know them all if they were forwarded. So correlation of a pid to a cri id is only possible via the span/trace id's. But the lowest level traces its very easy to correlate them. Make sense?

helsaawy · 2022-02-10T21:23:51Z

My 2c, span's should have identifying attributes, added by the method itself as not to assume its caller and be created on all entry and exit scopes (ie: the proto api, or maybe a method call out to a kernel or opengcs we dont control).

IE: a method CreateContainer(ctx, id, config) (string, error) this span should have identifying information about the container, the id that is passed in from the caller, but still inherit any baggage from the caller if present. Some of this context came from the calling API, but the ContainerId is created in this method and thus is not part of the span. It should be added to the baggage of the calling span on return.

However, the config of the container should not be on the span. It's not identifying attributes. It should be a log statement upon entry to the method about the entry args and actions being taken. However, I get the problem with sampling so this can create its own host of issues.

But sometimes this doesnt exactly work. For example, containerd has CRI id -> Task Id -> Container ID -> Container Pid. From the calling span (CRI) we only know exactly one of the id's. From the gcs we know them all if they were forwarded. So correlation of a pid to a cri id is only possible via the span/trace id's. But the lowest level traces its very easy to correlate them. Make sense?

I think so, yeah. I'll try that out in the next week or so, hopefully.

Updated span export to: * include span kind * log if span attributes were dropped * format time as string instead Added `log.S()` to set the log entry stored in the context with provided fields. `log.G()` now checks the context for a stored context. Added `log.Copy()` to add log entry and trace span from source context to destination, allowing for duplicating contexts but not cancellation. Added `log.U()` to update the context an entry (in the context) points to, allowing it to reference the latest span and other information. Added `oc.StartSpan` to set the context log-entries reference to the newly created valaue. Added helper `log.Format*` functions to format Time and other structs to JSON, but only if the logging level is high enough that the information would be logged. Set span kind to client/server, as required, for bridge RPC calls. Updated internal/cmd to use spans within `.Start()` and `.Wait()` Reduced the number of `INFO` level logs printed, downgrading to `DEBUG`. Moved traces and spans from inside `Once.Do()`, so that they are always logged, even if the `Once` does not execute. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>

Removing unnecessary spans in the codebase and replacing with trace logs Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>

helsaawy · 2022-05-09T19:37:14Z

Broken out into:

helsaawy force-pushed the he/traces branch 6 times, most recently from a556667 to 8bd4fe6 Compare January 26, 2022 23:50

helsaawy force-pushed the he/traces branch 6 times, most recently from 59a8385 to 6d49325 Compare February 8, 2022 00:01

helsaawy force-pushed the he/traces branch 2 times, most recently from 04678d4 to d0a82a0 Compare February 8, 2022 23:31

jterry75 reviewed Feb 10, 2022

View reviewed changes

cmd/containerd-shim-runhcs-v1/exec_clone.go Outdated Show resolved Hide resolved

helsaawy force-pushed the he/traces branch 7 times, most recently from 064cc68 to 366ed05 Compare February 16, 2022 22:06

helsaawy marked this pull request as ready for review February 16, 2022 22:13

helsaawy requested a review from a team as a code owner February 16, 2022 22:13

helsaawy marked this pull request as draft February 22, 2022 18:18

helsaawy force-pushed the he/traces branch from 366ed05 to 5083326 Compare April 20, 2022 01:54

helsaawy force-pushed the he/traces branch from 5083326 to 472d211 Compare April 20, 2022 21:21

Removing spans from guest and other areas

d67f083

Removing unnecessary spans in the codebase and replacing with trace logs Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>

helsaawy force-pushed the he/traces branch from fc5dc5e to d67f083 Compare April 25, 2022 17:22

helsaawy closed this May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase tracing within hcsshim #1280

Increase tracing within hcsshim #1280

helsaawy commented Jan 19, 2022 •

edited

Loading

jterry75 commented Feb 10, 2022

helsaawy commented Feb 10, 2022

jterry75 commented Feb 10, 2022

helsaawy commented Feb 10, 2022

helsaawy commented May 9, 2022

Increase tracing within hcsshim #1280

Increase tracing within hcsshim #1280

Conversation

helsaawy commented Jan 19, 2022 • edited Loading

jterry75 commented Feb 10, 2022

helsaawy commented Feb 10, 2022

jterry75 commented Feb 10, 2022

helsaawy commented Feb 10, 2022

helsaawy commented May 9, 2022

helsaawy commented Jan 19, 2022 •

edited

Loading