Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase tracing within hcsshim #1280

Closed
wants to merge 2 commits into from
Closed

Conversation

helsaawy
Copy link
Contributor

@helsaawy helsaawy commented Jan 19, 2022

Portions of hcsshim are un-instrumented in terms of debug logs and spans.

This PR increases the logging and tracing, and ties span generation to logging level.

Several opinions are taken within this PR:

  1. Spans are always generated and samples for major operations, even if logging level is Warning or higher. However, they are exported at log level Info. These events include:
    a. ttrpc spans of the form containerd.task.v2.Task.*
    b. containerd-shim-runhcs-v1 service spans of the form Create, State, etc...
  2. Spans are also created on public or long running methods and go routines (such as waitBackground) for code in internal/, but are only sampled if the logging level is higher than Trace.
    a. Other functions may use a .Trace( logging call to denote that a particular function is called, if they do not run long enough or are not deemed important enough to warrant the overhead of creating a span.
  3. Log entries are now stored within contexts, using the log.S() and log.WithContext(). They can be extracted with log.FromContext(). log.G() has also been updated to look for an entry stored in the context.
    a. This allows setting fields that will be present for all logs in the same or derived contexts.
  4. Most other logs generated in functions are at the Debug level, and are used to log what parameters or options are being used, that a particular object was created, or an action was finished
  5. Additionally, functions the send or receive requests over the bridge/guest connection should label their spans with kinds SpanKindClient or SpanKindServer, respectively. This is done automatically for TTRPC and gRPC servers.

Functions that are called from functions with sampled spans from containerd-shim-runhcs-v1 explicitly use StartTraceSpan to force their sampling to be tied to the logging level. Functions that are only called from internal code use the stock trace.StartSpace and inherit their sampler.

A nuance of this is that if a span is started with the LoggingLevelSampler or via oc.StartTraceSpan and the logging level is less than Trace, it will not be logged even if it registers an error and has a non-zero return status, since the span was disabled and will not reach the exporter stage.

Logging level can be passed in by the LogLevel setting in the containerd.toml file, so tracing can be enabled there.

@helsaawy helsaawy force-pushed the he/traces branch 6 times, most recently from a556667 to 8bd4fe6 Compare January 26, 2022 23:50
@helsaawy helsaawy force-pushed the he/traces branch 6 times, most recently from 59a8385 to 6d49325 Compare February 8, 2022 00:01
@helsaawy helsaawy force-pushed the he/traces branch 2 times, most recently from 04678d4 to d0a82a0 Compare February 8, 2022 23:31
@jterry75
Copy link
Contributor

I havent thought about this in ages so asking a few questions. For spans, don't we inherit the baggage for the trace statements? Is it better to have a param as baggage or as a specific trace? IE: PodId, TaskId, ContainerId, Pid seem like they should be on the span given they help to correlate all data. Or are we wanting them on the trace and use the spanid,traceid and look back to correlate?

@helsaawy
Copy link
Contributor Author

I havent thought about this in ages so asking a few questions. For spans, don't we inherit the baggage for the trace statements? Is it better to have a param as baggage or as a specific trace? IE: PodId, TaskId, ContainerId, Pid seem like they should be on the span given they help to correlate all data. Or are we wanting them on the trace and use the spanid,traceid and look back to correlate?

I was toying with that, especially if we put parameters only in the Spans, then debug statements from that same function will have no information if the spans are not printed. I was thinking about modifying log.G(ctx) to use the attributes added to its parent span to the returned log entry, but then (1) if the Span has a very large string attribute (ie, a whole json spec as a string), that will be printed for all logs in that span, and (2) if the Span is not sampled, adding attributes no-ops, so log statements wont have those fields.

I have no idea what the convention is, but the only reason I had for putting params solely in the spans and then correlating them back, was for performance reason, but I have no idea what the performance hit logging/tracing add.

@jterry75
Copy link
Contributor

My 2c, span's should have identifying attributes, added by the method itself as not to assume its caller and be created on all entry and exit scopes (ie: the proto api, or maybe a method call out to a kernel or opengcs we dont control).

IE: a method CreateContainer(ctx, id, config) (string, error) this span should have identifying information about the container, the id that is passed in from the caller, but still inherit any baggage from the caller if present. Some of this context came from the calling API, but the ContainerId is created in this method and thus is not part of the span. It should be added to the baggage of the calling span on return.

However, the config of the container should not be on the span. It's not identifying attributes. It should be a log statement upon entry to the method about the entry args and actions being taken. However, I get the problem with sampling so this can create its own host of issues.

But sometimes this doesnt exactly work. For example, containerd has CRI id -> Task Id -> Container ID -> Container Pid. From the calling span (CRI) we only know exactly one of the id's. From the gcs we know them all if they were forwarded. So correlation of a pid to a cri id is only possible via the span/trace id's. But the lowest level traces its very easy to correlate them. Make sense?

@helsaawy
Copy link
Contributor Author

My 2c, span's should have identifying attributes, added by the method itself as not to assume its caller and be created on all entry and exit scopes (ie: the proto api, or maybe a method call out to a kernel or opengcs we dont control).

IE: a method CreateContainer(ctx, id, config) (string, error) this span should have identifying information about the container, the id that is passed in from the caller, but still inherit any baggage from the caller if present. Some of this context came from the calling API, but the ContainerId is created in this method and thus is not part of the span. It should be added to the baggage of the calling span on return.

However, the config of the container should not be on the span. It's not identifying attributes. It should be a log statement upon entry to the method about the entry args and actions being taken. However, I get the problem with sampling so this can create its own host of issues.

But sometimes this doesnt exactly work. For example, containerd has CRI id -> Task Id -> Container ID -> Container Pid. From the calling span (CRI) we only know exactly one of the id's. From the gcs we know them all if they were forwarded. So correlation of a pid to a cri id is only possible via the span/trace id's. But the lowest level traces its very easy to correlate them. Make sense?

I think so, yeah. I'll try that out in the next week or so, hopefully.

@helsaawy helsaawy force-pushed the he/traces branch 7 times, most recently from 064cc68 to 366ed05 Compare February 16, 2022 22:06
@helsaawy helsaawy marked this pull request as ready for review February 16, 2022 22:13
@helsaawy helsaawy requested a review from a team as a code owner February 16, 2022 22:13
@helsaawy helsaawy marked this pull request as draft February 22, 2022 18:18
Updated span export to:
* include span kind
* log if span attributes were dropped
* format time as string instead

Added `log.S()` to set the log entry stored in the context with
provided fields. `log.G()` now checks the context for a stored context.
Added `log.Copy()` to add log entry and trace span from source context
to destination, allowing for duplicating contexts but not cancellation.
Added `log.U()` to update the context an entry (in the context) points
to, allowing it to reference the latest span and other information.
Added `oc.StartSpan` to set the context log-entries reference to the
newly created valaue.

Added helper `log.Format*` functions to format Time and other structs
to JSON, but only if the logging level is high enough that the
information would be logged.

Set span kind to client/server, as required, for bridge RPC calls.

Updated internal/cmd to use spans within `.Start()` and `.Wait()`

Reduced the number of `INFO` level logs printed, downgrading to
`DEBUG`.

Moved traces and spans from inside `Once.Do()`, so that they are
always logged, even if the `Once` does not execute.

Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Removing unnecessary spans in the codebase and replacing with trace logs

Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
@helsaawy helsaawy closed this May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants