Enhancement: tracing support in Prow jobs #30010

howardjohn · 2023-07-05T15:26:56Z

What would you like to be added: Support for distributed tracing in Prow. More details on what this means below

Why is this needed: To give visibility into job execution, both in a single job and in aggregate.

The end result we are looking for is to be able to generate a trace roughly like the following:

This was done via a POC, I think the real one can have more information.

Prior Art

https://gitlab.com/gitlab-org/gitlab/-/issues/338943
https://buildkite.com/docs/agent/v3/tracing
https://plugins.jenkins.io/opentelemetry/
https://github.com/tektoncd/community/blob/main/teps/0124-distributed-tracing-for-tasks-and-pipelines.md

Implementation

Prow job tracing primarily involves two parts: the infrastructure components, and the actual test logic. These should be formed into a single cohesive trace (see picture above, test logic is in yellow).

Test Logic

For the most part, how a test handles tracing is outside of scope of prow - it is the job author's responsibility. However, one aspect that needs care is ensuring spans reported by the test attach to the same trace as the infrastructure spans.

This is done by propagation. However, the typical way this is handle is by HTTP headers (traceparent) in distributed systems; this doesn't work here. While there is no ratified standard for doing this otherwise, there is a growing de-facto standard (see the prior arts) to use TRACEPARENT environment variable (open-telemetry/opentelemetry-specification#740). This seems well suited. This environment variable will need to be passed to the Pod environment and respected when the job sends traces.

Sending traces from the job is fairly straightforward from that point on. They will need to configure the job to send to the same tracing backend, of course, but otherwise can just send traces like normal. One issue may be that many jobs are largely bash; https://github.com/equinix-labs/otel-cli seems well suited to handle these cases.

Prow Infra

For the infra side, we will need to report spans about a variety of things. I think some interesting things to measure are:

End to end job execution, as the root span
Pod start - end
Pod scheduling, image pulling, etc
Containers running
Actions within these containers - for example, git operations within clonerefs

I think there are two main approaches to this:

Making a tracing reporter. This can look at the ProwJob and maybe other artifacts (clone-records.json) and compute the spans after the fact (its perfectly fine to send spans out of order and in the past).

This is POCed in https://github.com/howardjohn/prow-tracing (as a standalone binary that is pointed at a historic job).

This approach seems the least invasive to me, and is pretty effective I think.

One concern here is that since we are creating the spans after the job runs, we cannot set the TRACEPARENT environment variable on the job in this approach. There are a few options to this. Either we do a bit of the next option and add just the root span outside of the reporter, or we can abuse the fact that trace IDs are globally unique 16 bytes -- just like the prowjob build UID. Using this fact, we can always create the root span with an ID of the build, and test execution can use PROW_BUILD_ID when TRACEPARENT is not set (or that var can be set automatically by prow). This approach is taken in the POC above

Native integration

Rather than retroactive analysis, we can do 'proper' tracing and integrate it throughout prow. This would allow us to generate extremely fine grained traces about whatever we want. The risk is that it permeates the entire codebase, unlike the reporter mode which is completely standalone.

Configuration

I propose this only supports OpenTelemetry, which is the only recommended option these days. Within otel, though, there are a variety of "exporters" allowed. The primary one is "OTLP". This is a common protocol implemented by many vendors. In addition, otel offers a collector which accepts OTLP and does a variety of things, including exporting to anywhere.

One notable vendor that does not support OLTP is GCP tracing. I think most Prow users are using GCP, so this is a natural backend to use.

We could support OTLP + GCP, or just OTLP and GCP users can deploy a collector.

So overall I think we will only need a couple config items for the collector endpoint and maybe a few others

The text was updated successfully, but these errors were encountered:

howardjohn · 2023-07-11T16:26:02Z

I wrote up more about shell tracing. Not directly related but likely to be used in parallel: https://blog.howardjohn.info/posts/shell-tracing/

michelle192837 · 2023-07-11T17:15:26Z

/sig testing
/cc @cjwagner @petr-muller

michelle192837 · 2023-07-11T17:15:46Z

/assign @cjwagner @petr-muller

This will be used in conjunection with kubernetes/test-infra#30010 and https://blog.howardjohn.info/posts/shell-tracing/ to give us tracing of job execution to help understand and analyze job/test execution better. The tool is 18mb so pretty low cost.

petr-muller · 2023-08-07T19:33:59Z

Sorry for not getting to this sooner, things are hard to follow in summer between vacations and catching up with backlog after coming from vacation.

I am very fond of the proposed feature in general, will need to read the proposal better to discuss the details - I will get to that this week.

k8s-triage-robot · 2024-01-26T00:22:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-25T00:25:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-26T01:03:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-26T01:03:37Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

howardjohn added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 5, 2023

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 11, 2023

k8s-ci-robot assigned cjwagner and petr-muller Jul 11, 2023

howardjohn mentioned this issue Jul 14, 2023

Docker: add otel-cli tool istio/tools#2589

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: tracing support in Prow jobs #30010

Enhancement: tracing support in Prow jobs #30010

howardjohn commented Jul 5, 2023 •

edited

Loading

howardjohn commented Jul 11, 2023

michelle192837 commented Jul 11, 2023

michelle192837 commented Jul 11, 2023

petr-muller commented Aug 7, 2023

k8s-triage-robot commented Jan 26, 2024

k8s-triage-robot commented Feb 25, 2024

k8s-triage-robot commented Mar 26, 2024

k8s-ci-robot commented Mar 26, 2024

Enhancement: tracing support in Prow jobs #30010

Enhancement: tracing support in Prow jobs #30010

Comments

howardjohn commented Jul 5, 2023 • edited Loading

Prior Art

Implementation

Test Logic

Prow Infra

Configuration

howardjohn commented Jul 11, 2023

michelle192837 commented Jul 11, 2023

michelle192837 commented Jul 11, 2023

petr-muller commented Aug 7, 2023

k8s-triage-robot commented Jan 26, 2024

k8s-triage-robot commented Feb 25, 2024

k8s-triage-robot commented Mar 26, 2024

k8s-ci-robot commented Mar 26, 2024

howardjohn commented Jul 5, 2023 •

edited

Loading