Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define FaaS Metric Semantics #1052

Closed
wants to merge 7 commits into from
Closed

Conversation

kolanos
Copy link
Contributor

@kolanos kolanos commented Oct 5, 2020

Closes #1013

Changes

This pull request attempts to define the FaaS (Function as a Service) metrics semantics.

Related issues #

#1013

@kolanos kolanos force-pushed the issue/1013 branch 2 times, most recently from a0bda83 to ba19150 Compare October 5, 2020 16:03
@kolanos kolanos force-pushed the issue/1013 branch 3 times, most recently from 553cbe6 to d5602f1 Compare October 6, 2020 23:51
@kolanos kolanos marked this pull request as ready for review October 7, 2020 00:02
@kolanos kolanos requested review from a team October 7, 2020 00:02
@kolanos
Copy link
Contributor Author

kolanos commented Oct 7, 2020

Some questions I have:

  • Can we define metrics with multiple instruments? Such as having a metric be either a Counter or SumObserver and leave it up to the implementor?
  • I link to the FaaS trace semantics, such as for the Function Trigger Types. Do the metrics semantics need to define these types separately?

Copy link
Member

@justinfoote justinfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this submission @kolanos!
I have questions, some of which are definitely because of my limited understanding of faas.
And I've suggested some refactors,

But overall, this is awesome!


| Name | Instrument | Units | Description |
|------|------------|-------|-------------|
| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of these metrics, I'd like to see the span kind included in the namespacing. So for inbound invocations, I'd expect a name like faas.server.invoke_duration, or to align with the naming of HTTP and database metrics, faas.server.duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinfoote I went with faas.invoke_duration because the trace semantics already use faas.invoked_*. Also not sure if "server" makes sense in a FaaS context, but open to hearing what others think.

See: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point about the word SERVER. (I mean, it's serverless, right? What's this talk about a server?)
But for incoming invocations, the tracing semantic convention uses the span.kind of SERVER: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md#incoming-invocations
And I'd like to keep as much correlation as possible between trace and metric data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the other semantic conventions are shaping up, we'll have duration recorded like this elsewhere:

  • http.server.duration
  • http.client.duration
  • db.client.duration
  • messaging.producer.duration
  • messaging.consumer.duration

And hopefully for grpc:

  • grpc.server.duration
  • grpc.client.duration

Then for FaaS:

  • faas.invoke_duration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinfoote I can revise to match this convention. But yes, "server" doesn't have a lot of meaning in a FaaS context as the server is abstracted away. My reading of the trace semantics use of "server" was in an HTTP context where a function is acting as either a server or a client. But HTTP is only one of many use cases for FaaS, so was hoping for a more generalized term like "invoke" or "invocation".

| Name | Instrument | Units | Description |
|------|------------|-------|-------------|
| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. |
| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really an expert in FAAS, so maybe this is obvious to some of you, but I don't know the relationship between invoke_duration and init_duration. Is init duration included in invoke duration?
Or does total duration = invoke_duration + init_duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinfoote Good question. AWS Lambda considers an invocation's duration inclusive of any initialization (cold starts). So a faas.invoke_duration could be considerably longer during a cold start than for subsequent invocations post-initialization. But cold starts do have a real effect in a FaaS context. So wanted to make sure faas.invoke_duration reflected this reality. But would welcome opinions on whether duration metrics should be exclusive or inclusive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough question, and I'd love for someone with more experience with serverless architecture in general to weigh in.

I'm curious about how this is represented in tracing. I know that there's a top-level SERVER span defined for faas invocations. Is there a nested span within that that represents the initialization? If so, how is that span identified as initialization time?

specification/metrics/semantic_conventions/faas-metrics.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/faas-metrics.md Outdated Show resolved Hide resolved
| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start |
| `faas.coldstarts` | Counter | number of cold starts | Number of invocation cold starts. |
| `faas.errors` | Counter | number of errors | Number of invocation errors. |
| `faas.executions` | counter | number of invocations | number of successful invocations. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to derive this count from the faas.server.duration ValueRecorder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinfoote That's true. I was unclear on what the preference is on capturing counts for efficient aggregation on the client side. If the preference is to tally these aggregates on the metric consumer side then I can remove this redundant metric.

| `faas.errors` | Counter | number of errors | Number of invocation errors. |
| `faas.executions` | counter | number of invocations | number of successful invocations. |
| `faas.timeouts` | counter | number of timeouts | number of invocation timeouts. A timeout is an execution that reaches or exceeds configured execution time limits. |
| `faas.throttles` | counter | number of throttles | number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are timeouts and throttles represented in code as exceptions? What if we included exception.type on the duration instrument? This would allow us to count each exception type (if using NRQL, something like SELECT count(faas.server.duration) FACET exception.type)
This would also let us exclude timeouts from our duration metric, or to measure the duration of throttled requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinfoote Timeouts and throttles are both FaaS provider and runtime specific. For example, in AWS Lambda a timeout is detected within an invocation via a SIGTERM. It is also possible to synthesize a timeout in AWS Lambda via log messages Lambda writes when a timeout occurs.

Throttles are more complicated as they represent an exhaustion of concurrency and as a result an invocation does not occur in the first place. Metrics collection for throttles are also FaaS provider specific. Currently in AWS Lambda, for example, I believe the only way to get this metric is via CloudWatch metrics.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Copy link
Member

@justinfoote justinfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes! This PR is getting marked stale because we've sat on it for so long. I'm hoping these comments get discussion flowing again, so we can get this wrapped up.

specification/metrics/semantic_conventions/faas-metrics.md Outdated Show resolved Hide resolved
| Name | Instrument | Units | Description |
|------|------------|-------|-------------|
| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. |
| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough question, and I'd love for someone with more experience with serverless architecture in general to weigh in.

I'm curious about how this is represented in tracing. I know that there's a top-level SERVER span defined for faas invocations. Is there a nested span within that that represents the initialization? If so, how is that span identified as initialization time?

specification/metrics/semantic_conventions/faas-metrics.md Outdated Show resolved Hide resolved
@kolanos
Copy link
Contributor Author

kolanos commented Nov 4, 2020

@thisthat Would love your feedback since you were involved with the equivalent trace spec.

@kolanos
Copy link
Contributor Author

kolanos commented Nov 4, 2020

@justinfoote I removed the faas.coldstarts, faas.errors and faas.executions Counter instruments per your suggestion.

@github-actions github-actions bot removed the Stale label Nov 5, 2020
CHANGELOG.md Outdated Show resolved Hide resolved
Comment on lines +10 to +17
- ref: faas.invoked_name
required: always
- ref: faas.invoked_provider
required: always
- ref: faas.invoked_region
required: always
- ref: faas.coldstart
required: always
Copy link
Member

@thisthat thisthat Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coldstart is for an incoming lambda and invoked_* are for an outgoing lambda. I don't think it is correct to request all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree about coldstarts. Can add a condition there.

For invoked_*, doesn't this apply for incoming as well? Maybe I misunderstand the trace spec here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, the invoked_* attributes specify which function is invoked from a client and not the execution of such a function. For a function that is being executed, the same attributes are available as resource attributes. Namely, the mapping is:
invoked_name -> faas.name
invoked_provider -> cloud.provider
invoked_region -> cloud.region


Naming conventions follow [FaaS Trace Semantics](/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md) wherever possible.

| Name | Recommended | Notes and examples |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generate this table with:

<!-- semconv faas-metrics -->
<!-- endsemconv -->

and then you can use the semantic convention generator :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...not just yet. I'll have a PR to add metric semantic convention generation soon, and then we can update all the metric semantic conventions with generated tables in a single PR later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table looks identical to a semantic convention table. How do you plan to change the render of metric? But I guess this discussion does not belong to this PR, I will wait for your PR to update the tool :)

semantic_conventions/metrics/faas.yaml Outdated Show resolved Hide resolved
Comment on lines +25 to +26
| `faas.throttles` | Counter | number of throttles | number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. |
| `faas.concurrent_executions` | UpDownCounter | number of concurrent executions | The current number of function instances that are processing events. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about these two. I don't think an instrumented function can report these values. I see these as metrics that a backend can compute aggregating the data it receives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these metrics limited to only values that can be collected from within a function? Every FaaS platform that I researched has a way to extract these metrics, however in most cases it is via an API external to the function itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me, but if the metrics are collected using an API, I think they'll need to use an asynchronous instrument. I think maybe this means it should be a UpDownSumObserver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. We should not specify only data that can be collected within a function! :)

required: always
- ref: faas.invoked_provider
required: always
- ref: faas.invoked_region
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it mandatory to have a region ? What is the target for that convention, public cloud 'faas' only or also 'faas' running on-premise or somewhere else (like OpenFaaS or Knative which probably also would fall into that realm, even when no 'faas' but just 'serverless'. Tbh, faas as metrics here feels too narrow as you can cover with this metrics also serverless deployments, that are not a 'faas' (e.g. not centered around functions as a programming model but e.g containers that are operated in a serverless manner).

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 14, 2020
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 21, 2020
@jgals
Copy link

jgals commented Dec 4, 2020

@kolanos and @jmacd, I believe this PR was close prematurely by the bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define metric semantic conventions for FaaS
8 participants