-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define FaaS Metric Semantics #1052
Conversation
a0bda83
to
ba19150
Compare
553cbe6
to
d5602f1
Compare
Some questions I have:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this submission @kolanos!
I have questions, some of which are definitely because of my limited understanding of faas.
And I've suggested some refactors,
But overall, this is awesome!
|
||
| Name | Instrument | Units | Description | | ||
|------|------------|-------|-------------| | ||
| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all of these metrics, I'd like to see the span kind included in the namespacing. So for inbound invocations, I'd expect a name like faas.server.invoke_duration
, or to align with the naming of HTTP and database metrics, faas.server.duration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinfoote I went with faas.invoke_duration
because the trace semantics already use faas.invoked_*
. Also not sure if "server" makes sense in a FaaS context, but open to hearing what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point about the word SERVER
. (I mean, it's serverless, right? What's this talk about a server?)
But for incoming invocations, the tracing semantic convention uses the span.kind of SERVER
: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md#incoming-invocations
And I'd like to keep as much correlation as possible between trace and metric data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the other semantic conventions are shaping up, we'll have duration recorded like this elsewhere:
http.server.duration
http.client.duration
db.client.duration
messaging.producer.duration
messaging.consumer.duration
And hopefully for grpc:
grpc.server.duration
grpc.client.duration
Then for FaaS:
faas.invoke_duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinfoote I can revise to match this convention. But yes, "server" doesn't have a lot of meaning in a FaaS context as the server is abstracted away. My reading of the trace semantics use of "server" was in an HTTP context where a function is acting as either a server or a client. But HTTP is only one of many use cases for FaaS, so was hoping for a more generalized term like "invoke" or "invocation".
| Name | Instrument | Units | Description | | ||
|------|------------|-------|-------------| | ||
| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. | | ||
| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really an expert in FAAS, so maybe this is obvious to some of you, but I don't know the relationship between invoke_duration and init_duration. Is init duration included in invoke duration?
Or does total duration = invoke_duration + init_duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinfoote Good question. AWS Lambda considers an invocation's duration inclusive of any initialization (cold starts). So a faas.invoke_duration
could be considerably longer during a cold start than for subsequent invocations post-initialization. But cold starts do have a real effect in a FaaS context. So wanted to make sure faas.invoke_duration
reflected this reality. But would welcome opinions on whether duration metrics should be exclusive or inclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough question, and I'd love for someone with more experience with serverless architecture in general to weigh in.
I'm curious about how this is represented in tracing. I know that there's a top-level SERVER
span defined for faas invocations. Is there a nested span within that that represents the initialization? If so, how is that span identified as initialization time?
| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start | | ||
| `faas.coldstarts` | Counter | number of cold starts | Number of invocation cold starts. | | ||
| `faas.errors` | Counter | number of errors | Number of invocation errors. | | ||
| `faas.executions` | counter | number of invocations | number of successful invocations. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to derive this count from the faas.server.duration
ValueRecorder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinfoote That's true. I was unclear on what the preference is on capturing counts for efficient aggregation on the client side. If the preference is to tally these aggregates on the metric consumer side then I can remove this redundant metric.
| `faas.errors` | Counter | number of errors | Number of invocation errors. | | ||
| `faas.executions` | counter | number of invocations | number of successful invocations. | | ||
| `faas.timeouts` | counter | number of timeouts | number of invocation timeouts. A timeout is an execution that reaches or exceeds configured execution time limits. | | ||
| `faas.throttles` | counter | number of throttles | number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are timeouts and throttles represented in code as exceptions? What if we included exception.type
on the duration instrument? This would allow us to count each exception type (if using NRQL, something like SELECT count(faas.server.duration) FACET exception.type
)
This would also let us exclude timeouts from our duration metric, or to measure the duration of throttled requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinfoote Timeouts and throttles are both FaaS provider and runtime specific. For example, in AWS Lambda a timeout is detected within an invocation via a SIGTERM. It is also possible to synthesize a timeout in AWS Lambda via log messages Lambda writes when a timeout occurs.
Throttles are more complicated as they represent an exhaustion of concurrency and as a result an invocation does not occur in the first place. Metrics collection for throttles are also FaaS provider specific. Currently in AWS Lambda, for example, I believe the only way to get this metric is via CloudWatch metrics.
f94c184
to
06e2f37
Compare
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes! This PR is getting marked stale because we've sat on it for so long. I'm hoping these comments get discussion flowing again, so we can get this wrapped up.
| Name | Instrument | Units | Description | | ||
|------|------------|-------|-------------| | ||
| `faas.invoke_duration` | ValueRecorder | milliseconds | Measures the duration of the invocation, the time the function spent processing an event. | | ||
| `faas.init_duration` | ValueRecorder | milliseconds | Measures the duration of the function's initialization, such as a cold start | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough question, and I'd love for someone with more experience with serverless architecture in general to weigh in.
I'm curious about how this is represented in tracing. I know that there's a top-level SERVER
span defined for faas invocations. Is there a nested span within that that represents the initialization? If so, how is that span identified as initialization time?
@thisthat Would love your feedback since you were involved with the equivalent trace spec. |
@justinfoote I removed the |
- ref: faas.invoked_name | ||
required: always | ||
- ref: faas.invoked_provider | ||
required: always | ||
- ref: faas.invoked_region | ||
required: always | ||
- ref: faas.coldstart | ||
required: always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coldstart
is for an incoming lambda and invoked_*
are for an outgoing lambda. I don't think it is correct to request all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree about coldstarts. Can add a condition there.
For invoked_*, doesn't this apply for incoming as well? Maybe I misunderstand the trace spec here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, the invoked_*
attributes specify which function is invoked from a client and not the execution of such a function. For a function that is being executed, the same attributes are available as resource attributes. Namely, the mapping is:
invoked_name -> faas.name
invoked_provider -> cloud.provider
invoked_region -> cloud.region
|
||
Naming conventions follow [FaaS Trace Semantics](/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/faas.md) wherever possible. | ||
|
||
| Name | Recommended | Notes and examples | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can generate this table with:
<!-- semconv faas-metrics -->
<!-- endsemconv -->
and then you can use the semantic convention generator :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...not just yet. I'll have a PR to add metric semantic convention generation soon, and then we can update all the metric semantic conventions with generated tables in a single PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table looks identical to a semantic convention table. How do you plan to change the render of metric? But I guess this discussion does not belong to this PR, I will wait for your PR to update the tool :)
| `faas.throttles` | Counter | number of throttles | number of invocation throttles. A throttle is an invocation rejected when concurrrency limits are reached or exceeded. | | ||
| `faas.concurrent_executions` | UpDownCounter | number of concurrent executions | The current number of function instances that are processing events. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about these two. I don't think an instrumented function can report these values. I see these as metrics that a backend can compute aggregating the data it receives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these metrics limited to only values that can be collected from within a function? Every FaaS platform that I researched has a way to extract these metrics, however in most cases it is via an API external to the function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me, but if the metrics are collected using an API, I think they'll need to use an asynchronous instrument. I think maybe this means it should be a UpDownSumObserver
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. We should not specify only data that can be collected within a function! :)
Co-authored-by: Giovanni Liva <giovanni.liva@dynatrace.com>
required: always | ||
- ref: faas.invoked_provider | ||
required: always | ||
- ref: faas.invoked_region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it mandatory to have a region ? What is the target for that convention, public cloud 'faas' only or also 'faas' running on-premise or somewhere else (like OpenFaaS or Knative which probably also would fall into that realm, even when no 'faas' but just 'serverless'. Tbh, faas as metrics here feels too narrow as you can cover with this metrics also serverless deployments, that are not a 'faas' (e.g. not centered around functions as a programming model but e.g containers that are operated in a serverless manner).
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Closes #1013
Changes
This pull request attempts to define the FaaS (Function as a Service) metrics semantics.
Related issues #
#1013