Feature Request: add skaffold trace information (design + plumbing) #5756

aaron-prindle · 2021-04-29T18:29:59Z

For getting more information about skaffold's performance and ensuring that skaffold does not have performance degradation over time, skaffold should adding trace information to skaffold commands. Currently skaffold metrics can track the length of a skaffold dev session but there is no "child" trace information showing how specific actions took - yaml parsing, builds, deploys, etc.

The steps that I have outlined for this work include:

update opentelemetry (otel) libs (v0.13.0 -> v0.20.0) for skaffold (to use updated APIs & trace implementation) - update otel libs from v0.13.0 -> v0.20.0 #5757
fully plumb trace exporters, etc. (similar to metric exporter) & plumb tracing through skaffold (usable by env var) - skaffold trace wrapping of critical functions & skaffold trace exporters via SKAFFOLD_TRACE env var #5854
add functionality/example where users can get generate & view their own trace information w/ jaeger - skaffold trace wrapping of critical functions & skaffold trace exporters via SKAFFOLD_TRACE env var #5854

The text was updated successfully, but these errors were encountered:

What is the problem being solved? Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Updating out libs to the latest otel version adds additional useful functionality for tracing. Why is this the best approach? This approach uses go mod (updated via "go get <pkg>") and minor API changes to our otel API usage for the update. What other approaches did you consider? N/A What side effects will this approach have? There shouldn't be any side effects w/ this approach, the changes to otel's API were renaming/moving things (packages, functions, etc.). The only option removed was stdout Quantile aggregation (stdout.WithQuantiles) but I do not think this will have side effects. See open-telemetry/opentelemetry-go@49f699d#diff-2b283a7fb9f9b66e31a2b51a9ae9cad3599650a633f02fea9a956c4f6a714c6c What future work remains to be done? N/A

MarlonGamez · 2021-04-30T21:03:29Z

trying to make triage-party happy :)

What is the problem being solved? Part of #5756, adding opentelemetry trace information to skaffold commands. Updating out libs to the latest otel version adds additional useful functionality for tracing. Why is this the best approach? This approach uses go mod (updated via "go get <pkg>") and minor API changes to our otel API usage for the update. What other approaches did you consider? N/A What side effects will this approach have? There shouldn't be any side effects w/ this approach, the changes to otel's API were renaming/moving things (packages, functions, etc.). The only option removed was stdout Quantile aggregation (stdout.WithQuantiles) but I do not think this will have side effects. See open-telemetry/opentelemetry-go@49f699d#diff-2b283a7fb9f9b66e31a2b51a9ae9cad3599650a633f02fea9a956c4f6a714c6c What future work remains to be done? N/A

…rters What is the problem being solved? Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly. Why is this the best approach? Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out. Additionally there was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them. What other approaches did you consider? N/A What side effects will this approach have? There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold. What future work remains to be done? Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.)

…rters What is the problem being solved? Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly. Why is this the best approach? Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out. Additionally there was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them. What other approaches did you consider? N/A What side effects will this approach have? There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold. What future work remains to be done? Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added.

…rters What is the problem being solved? Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly. Why is this the best approach? Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out. What other approaches did you consider? There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them. What side effects will this approach have? There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold. What future work remains to be done? Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added.

…rters What is the problem being solved? Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly. Why is this the best approach? Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out. What other approaches did you consider? There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them. What side effects will this approach have? There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold. What future work remains to be done? Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added. Some future work mentioned in PR review included: - OTEL_TRACES_EXPORTER=* support (vs SKAFFOLD_TRACE)

…rters (#5854) What is the problem being solved? Part of #5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly. Why is this the best approach? Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out. What other approaches did you consider? There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them. What side effects will this approach have? There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold. What future work remains to be done? Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added. Some future work mentioned in PR review included: - OTEL_TRACES_EXPORTER=* support (vs SKAFFOLD_TRACE)

aaron-prindle · 2021-06-02T02:29:53Z

Fixed with #5854

aaron-prindle mentioned this issue Apr 29, 2021

update otel libs from v0.13.0 -> v0.20.0 #5757

Merged

MarlonGamez added the kind/todo implementation task/epic for the skaffold team label Apr 30, 2021

aaron-prindle added this to the v1.24.0 milestone May 3, 2021

aaron-prindle self-assigned this May 4, 2021

aaron-prindle added area/performance kind/feature-request kind/design discussion labels May 5, 2021

aaron-prindle changed the title ~~Feature Request: Skaffold Performance Observability - skaffold trace information~~ Feature Request: skaffold trace information (design + plumbing) May 5, 2021

aaron-prindle changed the title ~~Feature Request: skaffold trace information (design + plumbing)~~ Feature Request: add skaffold trace information (design + plumbing) May 5, 2021

aaron-prindle mentioned this issue May 5, 2021

Identify skaffold performance bottlenecks #5789

Closed

tejal29 added the planning/Q2-21 label May 10, 2021

aaron-prindle mentioned this issue May 17, 2021

skaffold trace wrapping of critical functions & skaffold trace exporters via SKAFFOLD_TRACE env var #5854

Merged

tejal29 added the priority/p1 High impact feature/bug. label May 17, 2021

tejal29 modified the milestones: v1.25.0, v1.26.0 May 26, 2021

aaron-prindle closed this as completed Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: add skaffold trace information (design + plumbing) #5756

Feature Request: add skaffold trace information (design + plumbing) #5756

aaron-prindle commented Apr 29, 2021 •

edited

Loading

MarlonGamez commented Apr 30, 2021

aaron-prindle commented Jun 2, 2021

Feature Request: add skaffold trace information (design + plumbing) #5756

Feature Request: add skaffold trace information (design + plumbing) #5756

Comments

aaron-prindle commented Apr 29, 2021 • edited Loading

MarlonGamez commented Apr 30, 2021

aaron-prindle commented Jun 2, 2021

aaron-prindle commented Apr 29, 2021 •

edited

Loading