-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: tracing support in Prow jobs #30010
Comments
I wrote up more about shell tracing. Not directly related but likely to be used in parallel: https://blog.howardjohn.info/posts/shell-tracing/ |
/sig testing |
/assign @cjwagner @petr-muller |
This will be used in conjunection with kubernetes/test-infra#30010 and https://blog.howardjohn.info/posts/shell-tracing/ to give us tracing of job execution to help understand and analyze job/test execution better. The tool is 18mb so pretty low cost.
This will be used in conjunection with kubernetes/test-infra#30010 and https://blog.howardjohn.info/posts/shell-tracing/ to give us tracing of job execution to help understand and analyze job/test execution better. The tool is 18mb so pretty low cost.
Sorry for not getting to this sooner, things are hard to follow in summer between vacations and catching up with backlog after coming from vacation. I am very fond of the proposed feature in general, will need to read the proposal better to discuss the details - I will get to that this week. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added: Support for distributed tracing in Prow. More details on what this means below
Why is this needed: To give visibility into job execution, both in a single job and in aggregate.
The end result we are looking for is to be able to generate a trace roughly like the following:
This was done via a POC, I think the real one can have more information.
Prior Art
https://gitlab.com/gitlab-org/gitlab/-/issues/338943
https://buildkite.com/docs/agent/v3/tracing
https://plugins.jenkins.io/opentelemetry/
https://github.com/tektoncd/community/blob/main/teps/0124-distributed-tracing-for-tasks-and-pipelines.md
Implementation
Prow job tracing primarily involves two parts: the infrastructure components, and the actual test logic. These should be formed into a single cohesive trace (see picture above, test logic is in yellow).
Test Logic
For the most part, how a test handles tracing is outside of scope of prow - it is the job author's responsibility. However, one aspect that needs care is ensuring spans reported by the test attach to the same trace as the infrastructure spans.
This is done by propagation. However, the typical way this is handle is by HTTP headers (
traceparent
) in distributed systems; this doesn't work here. While there is no ratified standard for doing this otherwise, there is a growing de-facto standard (see the prior arts) to useTRACEPARENT
environment variable (open-telemetry/opentelemetry-specification#740). This seems well suited. This environment variable will need to be passed to the Pod environment and respected when the job sends traces.Sending traces from the job is fairly straightforward from that point on. They will need to configure the job to send to the same tracing backend, of course, but otherwise can just send traces like normal. One issue may be that many jobs are largely bash; https://github.com/equinix-labs/otel-cli seems well suited to handle these cases.
Prow Infra
For the infra side, we will need to report spans about a variety of things. I think some interesting things to measure are:
git
operations withinclonerefs
I think there are two main approaches to this:
tracing
reporter. This can look at the ProwJob and maybe other artifacts (clone-records.json
) and compute the spans after the fact (its perfectly fine to send spans out of order and in the past).This is POCed in https://github.com/howardjohn/prow-tracing (as a standalone binary that is pointed at a historic job).
This approach seems the least invasive to me, and is pretty effective I think.
One concern here is that since we are creating the spans after the job runs, we cannot set the
TRACEPARENT
environment variable on the job in this approach. There are a few options to this. Either we do a bit of the next option and add just the root span outside of the reporter, or we can abuse the fact that trace IDs are globally unique 16 bytes -- just like the prowjob build UID. Using this fact, we can always create the root span with an ID of the build, and test execution can usePROW_BUILD_ID
whenTRACEPARENT
is not set (or that var can be set automatically by prow). This approach is taken in the POC aboveRather than retroactive analysis, we can do 'proper' tracing and integrate it throughout prow. This would allow us to generate extremely fine grained traces about whatever we want. The risk is that it permeates the entire codebase, unlike the reporter mode which is completely standalone.
Configuration
I propose this only supports OpenTelemetry, which is the only recommended option these days. Within otel, though, there are a variety of "exporters" allowed. The primary one is "OTLP". This is a common protocol implemented by many vendors. In addition, otel offers a collector which accepts OTLP and does a variety of things, including exporting to anywhere.
One notable vendor that does not support OLTP is GCP tracing. I think most Prow users are using GCP, so this is a natural backend to use.
We could support OTLP + GCP, or just OTLP and GCP users can deploy a collector.
So overall I think we will only need a couple config items for the collector endpoint and maybe a few others
The text was updated successfully, but these errors were encountered: