Improving observability of Spin #2293

calebschoepp · 2024-02-23T23:34:39Z

Observability is critical for a great developer experience. We should work to improve the observability of Spin, but that is a very vague statement. What exactly are we improving the observability of? Spin itself? Spin apps?

This issue is meant to act as a meta-issue that clarifies what we mean by "improving observability of Spin". It will provide a lay of the land by describing the different levels of observability within Spin that we want to improve. Other issues, SIPs, and PRs will be used to track the actual work of improving the observability and they can backlink to this meta-issue.

Before we dive in I want to note that OpenTelemetry has become the industry standard for observability data and is the standard we would want to conform to.

Types of observability in Spin

I propose that there are four types of observability in Spin that we want to enable. They exist on a spectrum from host-focused to guest-focused.

1) Runtime observability — observing the Spin runtime itself

Developers operating Spin in a production environment want observability into the state of the Spin process itself. This would include among other things:

Emit spans for any critical background work e.g. garbage collection.
Emit metrics on things like connection pool sizes.
Potentially emit Spin logs in an OTEL format?

Some notable non-requirements include:

Emitting metrics around CPU/memory usage. These should be collected via an agent on the node.
Emitting metrics around aggregate stats of things like request or error count. These should be created by aggregating the request observability data downstream.

2) Trigger observability — observing the requests made to Spin applications

Developers want observability into the requests that are made to their Spin application. This would include among other things:

Emit span when a Spin application is triggered with metadata about the trigger event.
Support trace context propagation from incoming headers and pass trace context propagation on outbound calls. Provide configuration in spin.toml to enable or disable this for security reasons.
Emit metrics about Spin application trigger events.
Potentially emit Spin application logs in an OTEL format?

3) Component observability — observing the interaction between composed components

Developers will create their Spin applications from a composition of components. Ideally we can automatically emit spans as the component composition graph is traversed and components are executed. This would include among other things:

Emit span when each Wasm component is executed.
Support trace contexts between each component.

This would require upstream modifications in Wasmtime.

4) Guest observability — observing the code within the guest module

Developers want to be able to instrument their own guest code. This allows them to emit telemetry with spans, metadata, and metrics unique to their own use case. We are reliant on the upstream WASI Observe proposal to make this happen. The upstream proposal has the clearest definition of requirements, but briefly for Spin to act as a host implementation we would require:

A host component that satisfies the WASI Observe WIT interface.
Potentially modifications to the Spin SDKs to support emitting telemetry (metrics, spans, etc.)?

Other observability related things

Here are some other observability related things we might want to do to make the experience better in Spin.

Streamline the process of collecting and viewing the observability data

The four types of observability outlined in the above section all just emit telemetry and expect that there is a collector running somewhere to collect the data. It would be good clearly document the process of running a collector for any users who don't already use a specific collector in their environment.

We could take this one step further if we wanted and build this collector into Spin (or a plugin or an app like KV explorer) if we really wanted to streamline the experience.

Create an observability standard that other Spin runtimes can match

Spin is not the only Spin runtime. Observability should be implemented into Spin such that other Spin runtimes can follow suit too.

Prior art

SIP: Opentelemetry integration proposal #655 is a SIP that was proposing observability at both level 2 (trigger observability) and level 4 (guest observability).

The text was updated successfully, but these errors were encountered:

calebschoepp · 2024-02-23T23:48:23Z

Here is an example of what a trace might look when levels 2 through 4 are combined.

calebschoepp · 2024-02-23T23:49:38Z

Trigger observability seems like the most tractable and immediately problem so I'm going to get started on a SIP for how we could implement it.

macolso · 2024-02-27T17:28:22Z

Question for my own understanding: is CPU / memory utilization considered a runtime or guest metric? For example, Azure Application Insights emits a metric called Process CPU, which shows how much of the total processor capacity is consumed by the process that is hosting your monitored app. I would consider Application Insights a tool for guest observability so this seems like a grey area.

calebschoepp · 2024-02-27T17:43:14Z

Question for my own understanding: is CPU / memory utilization considered a runtime or guest metric? For example, Azure Application Insights emits a metric called Process CPU, which shows how much of the total processor capacity is consumed by the process that is hosting your monitored app. I would consider Application Insights a tool for guest observability so this seems like a grey area.

I suppose it could be considered both. We might want to emit CPU/Memory utilization from the trigger observability i.e. how much CPU/Memory did an invocation of an app use. This would be considered guest metrics. Someone could also use an agent on the node to collect the CPU/Memory utilization of Spin itself and this would be a runtime metric.

I'm not really sure if this answers your question though because your question seems specific to the semantics of App Insights which I'm not really familiar with.

calebschoepp · 2024-02-27T18:21:17Z

@rylev had a good suggestion that we should make sure to clearly document our patterns around spans e.g. how do we name them, what metadata do they have, when should we emit them. That way the traces that get created can be more consistent and useful.

https://github.com/open-telemetry/opentelemetry-specification/blob/v1.26.0/specification/trace/api.md#span

calebschoepp · 2024-03-13T16:15:08Z

Seeing as this is a meta-issue tracking a lot of work I'm wondering if it shouldn't be set in progress. @vdice what do you think?

vdice · 2024-03-13T17:00:14Z

@calebschoepp 👍 Sounds good. Thanks!

agardnerIT · 2024-03-29T21:15:38Z

As someone with Observability experience and a CNCF ambassador, please LMK if I can assist here. I am happy to act in a vendor-neutral consultant role.

lann · 2024-03-29T21:29:52Z

@agardnerIT Thanks! The most recent work in progress is at #2398 if you are interested in following.

calebschoepp · 2024-06-14T20:45:47Z

This work is sufficiently far along that I'm closing this initial ticket.

lann mentioned this issue Feb 26, 2024

Service chaining SIP #2290

Merged

calebschoepp mentioned this issue Feb 26, 2024

Create a Grafana dashboard(s) to present metrics generated by a SpinApp and Spin Operator spinframework/spin-operator#16

Open

This was referenced Mar 11, 2024

SIP: Opentelemetry integration proposal #655

Closed

feat(*): Implement the skeleton of an OTEL observability system #2348

Merged

vdice added this to Spin Triage Mar 13, 2024

vdice moved this to 🆕 Triage Needed in Spin Triage Mar 13, 2024

vdice assigned calebschoepp Mar 13, 2024

vdice moved this from 🆕 Triage Needed to 🏗 In progress in Spin Triage Mar 13, 2024

vdice moved this from 🏗 In progress to 🔖 Backlog in Spin Triage Mar 13, 2024

vdice unassigned calebschoepp Mar 13, 2024

calebschoepp closed this as completed Jun 14, 2024

github-project-automation bot moved this from 🔖 Backlog to ✅ Done in Spin Triage Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving observability of Spin #2293

Improving observability of Spin #2293

calebschoepp commented Feb 23, 2024

calebschoepp commented Feb 23, 2024

calebschoepp commented Feb 23, 2024

macolso commented Feb 27, 2024

calebschoepp commented Feb 27, 2024

calebschoepp commented Feb 27, 2024 •

edited

Loading

calebschoepp commented Mar 13, 2024

vdice commented Mar 13, 2024

agardnerIT commented Mar 29, 2024 •

edited

Loading

lann commented Mar 29, 2024

calebschoepp commented Jun 14, 2024

Improving observability of Spin #2293

Improving observability of Spin #2293

Comments

calebschoepp commented Feb 23, 2024

Types of observability in Spin

1) Runtime observability — observing the Spin runtime itself

2) Trigger observability — observing the requests made to Spin applications

3) Component observability — observing the interaction between composed components

4) Guest observability — observing the code within the guest module

Other observability related things

Streamline the process of collecting and viewing the observability data

Create an observability standard that other Spin runtimes can match

Prior art

calebschoepp commented Feb 23, 2024

calebschoepp commented Feb 23, 2024

macolso commented Feb 27, 2024

calebschoepp commented Feb 27, 2024

calebschoepp commented Feb 27, 2024 • edited Loading

calebschoepp commented Mar 13, 2024

vdice commented Mar 13, 2024

agardnerIT commented Mar 29, 2024 • edited Loading

lann commented Mar 29, 2024

calebschoepp commented Jun 14, 2024

calebschoepp commented Feb 27, 2024 •

edited

Loading

agardnerIT commented Mar 29, 2024 •

edited

Loading