Define FaaS Metric Semantics #1052

kolanos · 2020-10-05T04:30:44Z

Closes #1013

Changes

This pull request attempts to define the FaaS (Function as a Service) metrics semantics.

Related issues #

#1013

specification/metrics/semantic_conventions/faas-metrics.md

kolanos · 2020-10-07T00:05:53Z

Some questions I have:

Can we define metrics with multiple instruments? Such as having a metric be either a Counter or SumObserver and leave it up to the implementor?
I link to the FaaS trace semantics, such as for the Function Trigger Types. Do the metrics semantics need to define these types separately?

specification/metrics/semantic_conventions/faas-metrics.md

justinfoote

Thanks for this submission @kolanos!
I have questions, some of which are definitely because of my limited understanding of faas.
And I've suggested some refactors,

But overall, this is awesome!

justinfoote · 2020-10-08T15:23:33Z

specification/metrics/semantic_conventions/faas-metrics.md

+
+| Name | Instrument | Units | Description |
+|------|------------|-------|-------------|
+| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. |


In all of these metrics, I'd like to see the span kind included in the namespacing. So for inbound invocations, I'd expect a name like faas.server.invoke_duration, or to align with the naming of HTTP and database metrics, faas.server.duration.

@justinfoote I went with faas.invoke_duration because the trace semantics already use faas.invoked_*. Also not sure if "server" makes sense in a FaaS context, but open to hearing what others think.

See: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md

I see your point about the word SERVER. (I mean, it's serverless, right? What's this talk about a server?)
But for incoming invocations, the tracing semantic convention uses the span.kind of SERVER: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md#incoming-invocations
And I'd like to keep as much correlation as possible between trace and metric data.

As the other semantic conventions are shaping up, we'll have duration recorded like this elsewhere:

http.server.duration

http.client.duration

db.client.duration

messaging.producer.duration

messaging.consumer.duration

And hopefully for grpc:

grpc.server.duration

grpc.client.duration

Then for FaaS:

faas.invoke_duration

@justinfoote I can revise to match this convention. But yes, "server" doesn't have a lot of meaning in a FaaS context as the server is abstracted away. My reading of the trace semantics use of "server" was in an HTTP context where a function is acting as either a server or a client. But HTTP is only one of many use cases for FaaS, so was hoping for a more generalized term like "invoke" or "invocation".

justinfoote · 2020-10-08T15:25:51Z

specification/metrics/semantic_conventions/faas-metrics.md

+| Name | Instrument | Units | Description |
+|------|------------|-------|-------------|
+| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. |
+| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start |


I'm not really an expert in FAAS, so maybe this is obvious to some of you, but I don't know the relationship between invoke_duration and init_duration. Is init duration included in invoke duration?
Or does total duration = invoke_duration + init_duration?

@justinfoote Good question. AWS Lambda considers an invocation's duration inclusive of any initialization (cold starts). So a faas.invoke_duration could be considerably longer during a cold start than for subsequent invocations post-initialization. But cold starts do have a real effect in a FaaS context. So wanted to make sure faas.invoke_duration reflected this reality. But would welcome opinions on whether duration metrics should be exclusive or inclusive.

This is a tough question, and I'd love for someone with more experience with serverless architecture in general to weigh in.

I'm curious about how this is represented in tracing. I know that there's a top-level SERVER span defined for faas invocations. Is there a nested span within that that represents the initialization? If so, how is that span identified as initialization time?

specification/metrics/semantic_conventions/faas-metrics.md

justinfoote · 2020-10-08T15:27:56Z

specification/metrics/semantic_conventions/faas-metrics.md

+| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start |
+| `faas.coldstarts` | Counter | number of cold starts | Number of invocation cold starts. |
+| `faas.errors` | Counter | number of errors | Number of invocation errors. |
+| `faas.executions` | counter | number of invocations | number of successful invocations. |


It should be possible to derive this count from the faas.server.duration ValueRecorder.

@justinfoote That's true. I was unclear on what the preference is on capturing counts for efficient aggregation on the client side. If the preference is to tally these aggregates on the metric consumer side then I can remove this redundant metric.

justinfoote · 2020-10-08T15:30:20Z

specification/metrics/semantic_conventions/faas-metrics.md

+| `faas.errors` | Counter | number of errors | Number of invocation errors. |
+| `faas.executions` | counter | number of invocations | number of successful invocations. |
+| `faas.timeouts` | counter | number of timeouts | number of invocation timeouts. A timeout is an execution that reaches or exceeds configured execution time limits. |
+| `faas.throttles` | counter | number of throttles | number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. |


Are timeouts and throttles represented in code as exceptions? What if we included exception.type on the duration instrument? This would allow us to count each exception type (if using NRQL, something like SELECT count(faas.server.duration) FACET exception.type)
This would also let us exclude timeouts from our duration metric, or to measure the duration of throttled requests.

@justinfoote Timeouts and throttles are both FaaS provider and runtime specific. For example, in AWS Lambda a timeout is detected within an invocation via a SIGTERM. It is also possible to synthesize a timeout in AWS Lambda via log messages Lambda writes when a timeout occurs.

Throttles are more complicated as they represent an exhaustion of concurrency and as a result an invocation does not occur in the first place. Metrics collection for throttles are also FaaS provider specific. Currently in AWS Lambda, for example, I believe the only way to get this metric is via CloudWatch metrics.

specification/metrics/semantic_conventions/faas-metrics.md

github-actions · 2020-10-30T03:15:05Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

justinfoote

Yikes! This PR is getting marked stale because we've sat on it for so long. I'm hoping these comments get discussion flowing again, so we can get this wrapped up.

specification/metrics/semantic_conventions/faas-metrics.md

justinfoote · 2020-10-30T16:37:10Z

specification/metrics/semantic_conventions/faas-metrics.md

+| Name | Instrument | Units | Description |
+|------|------------|-------|-------------|
+| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. |
+| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start |


This is a tough question, and I'd love for someone with more experience with serverless architecture in general to weigh in.

I'm curious about how this is represented in tracing. I know that there's a top-level SERVER span defined for faas invocations. Is there a nested span within that that represents the initialization? If so, how is that span identified as initialization time?

specification/metrics/semantic_conventions/faas-metrics.md

kolanos · 2020-11-04T17:07:23Z

@thisthat Would love your feedback since you were involved with the equivalent trace spec.

kolanos · 2020-11-04T18:04:28Z

@justinfoote I removed the faas.coldstarts, faas.errors and faas.executions Counter instruments per your suggestion.

CHANGELOG.md

thisthat · 2020-11-05T07:24:19Z

semantic_conventions/metrics/faas.yaml

+        - ref: faas.invoked_name
+          required: always
+        - ref: faas.invoked_provider
+          required: always
+        - ref: faas.invoked_region
+          required: always
+        - ref: faas.coldstart
+          required: always


coldstart is for an incoming lambda and invoked_* are for an outgoing lambda. I don't think it is correct to request all of them.

Agree about coldstarts. Can add a condition there.

For invoked_*, doesn't this apply for incoming as well? Maybe I misunderstand the trace spec here.

To my understanding, the invoked_* attributes specify which function is invoked from a client and not the execution of such a function. For a function that is being executed, the same attributes are available as resource attributes. Namely, the mapping is:
invoked_name -> faas.name
invoked_provider -> cloud.provider
invoked_region -> cloud.region

thisthat · 2020-11-05T07:34:05Z

specification/metrics/semantic_conventions/faas-metrics.md

+
+Naming conventions follow [FaaS Trace Semantics](/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md) wherever possible.
+
+| Name | Recommended | Notes and examples |


You can generate this table with:

 

and then you can use the semantic convention generator :)

...not just yet. I'll have a PR to add metric semantic convention generation soon, and then we can update all the metric semantic conventions with generated tables in a single PR later.

This table looks identical to a semantic convention table. How do you plan to change the render of metric? But I guess this discussion does not belong to this PR, I will wait for your PR to update the tool :)

semantic_conventions/metrics/faas.yaml

thisthat · 2020-11-05T07:40:41Z

specification/metrics/semantic_conventions/faas-metrics.md

+| `faas.throttles` | Counter | number of throttles | number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. |
+| `faas.concurrent_executions` | UpDownCounter | number of concurrent executions | The current number of function instances that are processing events. |


I am not sure about these two. I don't think an instrumented function can report these values. I see these as metrics that a backend can compute aggregating the data it receives.

Are these metrics limited to only values that can be collected from within a function? Every FaaS platform that I researched has a way to extract these metrics, however in most cases it is via an API external to the function itself.

That makes sense to me, but if the metrics are collected using an API, I think they'll need to use an asynchronous instrument. I think maybe this means it should be a UpDownSumObserver.

I think you are right. We should not specify only data that can be collected within a function! :)

Closes open-telemetry#1013

Co-authored-by: Giovanni Liva <giovanni.liva@dynatrace.com>

rhuss · 2020-11-05T21:30:52Z

semantic_conventions/metrics/faas.yaml

+          required: always
+        - ref: faas.invoked_provider
+          required: always
+        - ref: faas.invoked_region


Is it mandatory to have a region ? What is the target for that convention, public cloud 'faas' only or also 'faas' running on-premise or somewhere else (like OpenFaaS or Knative which probably also would fall into that realm, even when no 'faas' but just 'serverless'. Tbh, faas as metrics here feels too narrow as you can cover with this metrics also serverless deployments, that are not a 'faas' (e.g. not centered around functions as a programming model but e.g containers that are operated in a serverless manner).

github-actions · 2020-11-14T03:19:08Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2020-11-21T03:20:53Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

jgals · 2020-12-04T16:14:39Z

@kolanos and @jmacd, I believe this PR was close prematurely by the bot.

kolanos force-pushed the issue/1013 branch 2 times, most recently from a0bda83 to ba19150 Compare October 5, 2020 16:03

zivsalzmanappd reviewed Oct 6, 2020

View reviewed changes

specification/metrics/semantic_conventions/faas-metrics.md Show resolved Hide resolved

kolanos force-pushed the issue/1013 branch 3 times, most recently from 553cbe6 to d5602f1 Compare October 6, 2020 23:51

kolanos marked this pull request as ready for review October 7, 2020 00:02

kolanos requested review from a team October 7, 2020 00:02

arminru reviewed Oct 7, 2020

View reviewed changes

specification/metrics/semantic_conventions/faas-metrics.md Show resolved Hide resolved

specification/metrics/semantic_conventions/faas-metrics.md Show resolved Hide resolved

specification/metrics/semantic_conventions/faas-metrics.md Show resolved Hide resolved

justinfoote reviewed Oct 8, 2020

View reviewed changes

justinfoote mentioned this pull request Oct 9, 2020

Update Metrics Semantic Conventions README #1084

Open

kolanos force-pushed the issue/1013 branch 3 times, most recently from f94c184 to 06e2f37 Compare October 15, 2020 07:56

jmacd approved these changes Oct 15, 2020

View reviewed changes

specification/metrics/semantic_conventions/faas-metrics.md Outdated Show resolved Hide resolved

kolanos force-pushed the issue/1013 branch from 06e2f37 to e93a1a3 Compare October 22, 2020 15:54

github-actions bot added the Stale label Oct 30, 2020

kolanos force-pushed the issue/1013 branch from b0850b5 to f23990a Compare October 30, 2020 11:10

justinfoote reviewed Oct 30, 2020

View reviewed changes

justinfoote mentioned this pull request Nov 3, 2020

SMI Metrics and OpenTelemetry servicemeshinterface/smi-spec#199

Closed

kolanos force-pushed the issue/1013 branch from f23990a to f769881 Compare November 4, 2020 16:40

github-actions bot removed the Stale label Nov 5, 2020

thisthat reviewed Nov 5, 2020

View reviewed changes

Define FaaS Metric Semantics

0bd0ef3

Closes open-telemetry#1013

kolanos and others added 6 commits November 5, 2020 07:32

Add concurrent executions, references to trace semantics

44d9225

Add YAML spec, update changelog

d666a30

Update YAML references

864e40a

Fix counter capitalization

7e8c294

Removing coldstart, error and execution counter instruments

1219aab

Apply suggestions from code review

b6519a3

Co-authored-by: Giovanni Liva <giovanni.liva@dynatrace.com>

kolanos force-pushed the issue/1013 branch from 08c5afe to b6519a3 Compare November 5, 2020 15:35

rhuss reviewed Nov 5, 2020

View reviewed changes

github-actions bot added the Stale label Nov 14, 2020

github-actions bot closed this Nov 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define FaaS Metric Semantics #1052

Define FaaS Metric Semantics #1052

kolanos commented Oct 5, 2020 •

edited

Loading

kolanos commented Oct 7, 2020 •

edited

Loading

justinfoote left a comment

justinfoote Oct 8, 2020

kolanos Oct 8, 2020

justinfoote Oct 9, 2020

justinfoote Oct 9, 2020

kolanos Oct 9, 2020

justinfoote Oct 8, 2020

kolanos Oct 8, 2020

justinfoote Oct 30, 2020

justinfoote Oct 8, 2020

kolanos Oct 8, 2020

justinfoote Oct 8, 2020

kolanos Oct 8, 2020

github-actions bot commented Oct 30, 2020

justinfoote left a comment

justinfoote Oct 30, 2020

kolanos commented Nov 4, 2020

kolanos commented Nov 4, 2020

thisthat Nov 5, 2020 •

edited

Loading

kolanos Nov 5, 2020

thisthat Nov 6, 2020

thisthat Nov 5, 2020

justinfoote Nov 5, 2020

thisthat Nov 6, 2020

thisthat Nov 5, 2020

kolanos Nov 5, 2020

justinfoote Nov 5, 2020

thisthat Nov 6, 2020

rhuss Nov 5, 2020

github-actions bot commented Nov 14, 2020

github-actions bot commented Nov 21, 2020

jgals commented Dec 4, 2020


		Naming conventions follow [FaaS Trace Semantics](/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md) wherever possible.

		\| Name \| Recommended \| Notes and examples \|

		\| `faas.throttles` \| Counter \| number of throttles \| number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. \|
		\| `faas.concurrent_executions` \| UpDownCounter \| number of concurrent executions \| The current number of function instances that are processing events. \|

Define FaaS Metric Semantics #1052

Define FaaS Metric Semantics #1052

Conversation

kolanos commented Oct 5, 2020 • edited Loading

Changes

kolanos commented Oct 7, 2020 • edited Loading

justinfoote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 30, 2020

justinfoote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolanos commented Nov 4, 2020

kolanos commented Nov 4, 2020

thisthat Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 14, 2020

github-actions bot commented Nov 21, 2020

jgals commented Dec 4, 2020

kolanos commented Oct 5, 2020 •

edited

Loading

kolanos commented Oct 7, 2020 •

edited

Loading

thisthat Nov 5, 2020 •

edited

Loading