Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud.resource_id and AWS ECS #677

Open
mmanciop opened this issue Jan 30, 2024 · 24 comments
Open

cloud.resource_id and AWS ECS #677

mmanciop opened this issue Jan 30, 2024 · 24 comments
Assignees

Comments

@mmanciop
Copy link

mmanciop commented Jan 30, 2024

I am wondering what is the right way to implement support for cloud.resource.id in SDK resource detectors for AWS ECS. Specifically, it's unclear to me whether it should be the Task ARN or the Container ARN. No ECS detector so far has implemented cloud.resource_id, so we do not have precedent.

The semantic conventions state about cloud.resource_id:

Cloud provider-specific native identifier of the monitored cloud resource (e.g. an ARN on AWS, a fully qualified resource ID on Azure, a full resource name on GCP)

Given the fact that SDKs "live" in containers within the task, one could see the Container ARN as right.

However, I am leaning towards the Task ARN: the container is a non-independent or self-container part of a task, and we have already container.name and container.id that technically could be used to recreate the container ARN starting from the task's, while the "other direction" is not true.

One could also argue that we already set the Task ARN as the aws.ecs.task.arn resource attribute, but I find that a rather unconvincing argument.

For reference, using the metadata v4 examples by AWS:

Docker ID (Container ID): "ea32192c8553fbff06c9340478a2ff089b2bb5646fb718b4ee206641c9086d66"
Container ARN: "arn:aws:ecs:us-west-2:111122223333:container/0206b271-b33f-47ab-86c6-a0ba208a70a9"
Task ARN: "arn:aws:ecs:us-west-2:111122223333:task/default/8f03e41243824aea923aca126495f665"

The Docker ID is already stored by the various implementations of the ECS detectors under the container.id resource attribute.

What's the consensus? As I am anyhow touching up the various ECS detectors to support more cloud.* attributes (cloud.account.id, cloud.availability_zone and cloud.region specifically), I would have PRs up for all languages with ECS detectors (Go, .NET, Java, Python, PHP, Node.js) available in a matter of hours.

@mmanciop
Copy link
Author

@mhausenblas wdyt?

@andrew-rowson-lseg
Copy link

I was originally thinking that the most-granular ARN should go into the cloud.resource.id, as it most uniquely identifies the "Cloud Resource" that's publishing otel data.

I certainly see the argument to say that with the inclusion of container.name / container.id, we get that granularity even whilst putting the taskArn into cloud.resource.id.

One pause for thought is the language of "monitored cloud resource" in the definition. Some SDKs will be collecting otel data about the ECS cluster itself, which raises the question of whether cloud.resource.id shouldn't actually be the ECS Cluster ARN? I perhaps think that the phrase "monitored cloud resource" fits better onto cluster or container resources, rather than a "Task", which is simply a scheduling artefact.

Do we have any data from other container scheduling technologies (e.g. K8s) about how this is handled there? I assume there's a similar issue, where (for example) a container runs within a Pod, which is scheduled by a Deployment onto a specific Node of a Cluster. Which of those entities fits into cloud.resource.id?

I'll put some more thought into it.

@mmanciop
Copy link
Author

mmanciop commented Jan 31, 2024

Some SDKs will be collecting otel data about the ECS cluster itself

Are you thinking the way the Datadog agent is scheduled on ECS to monitor ECS? That is IMO not a resource detector use case, as the Datadog agent effectively acts as a "receiver" for telemetry about something external. In that case, it's the receiver's job to annotate the aws.ecs, container.*, cloud.* etc resource attributes.

(Besides, all ECS detectors out there do set already the aws.ecs.cluster.arn resource attribute, so that data is readily available somewhere; just not in a way that is easier to "group by" across container orchestrations.)

Do we have any data from other container scheduling technologies (e.g. K8s) about how this is handled there?

Not really. In general pods on K8S can learn far less about their environment than tasks on ECS. There's effectively no equivalent to the ECS metadata endpoint. It's even adventurous to be sure you are running inside a pod on K8S, and unless the pod spec is propagating as env vars data from the Downward API (which is IMO not a general enough mechanic to support ootb in a detector), you won't even know for sure what scheduled the pod (deployment? "naked" resource set? statefulset? job? cronjob? standalone pod?), as knowing the scheduling resource requires you to inspect the owner reference and to the best of my knowledge, that data is not available inside pods unless explicitly put somehow through the pod spec (and, again, not a general enough mechanic to support ootb in a detector).

And it's even worse than this: a K8S cluster does not have a notion of identity (not even of name, much less a unique one, although people set the UUID of the kube-system namespace as k8s.cluster.uid), to the extent that to get the EKS ARN for the aws.eks.cluster.arn resource attribute, one needs to do some very dirty stuff looking up EC2 instance IDs of nodes or OIDC providers and correlating that with that from other AWS APIs (Lumigo did it first AFAIK, Splunk followed).

On K8S, the addition of labels like cloud.resource_id is the kind of stuff that a processor like k8sattributeprocessor does, and AFAIK it has no support to add cloud.* attributes, only k8s.* and container.* ones.

I am not aware of any container orchestration platform currently better supported than K8S or EKS by OTel resource detectors btw.

@mmanciop
Copy link
Author

mmanciop commented Jan 31, 2024

In terms of use-cases, I think the most important thing to consider here are "group by" scenarios, specifically: what is the most useful abstraction level for the cloud.resource_id to be grouped by in terms of comparison with other Cloud resources?

The cluster ARN fails IMO this test, as Fargate tasks on the same cluster are pretty unrelated with one another in terms of performance. The Task ARN and Container ARN both pass this test, and I personally still like Task ARN more because of the resource sharing that is going on between containers in the same task.

Also, when comparing with the other types of Cloud resources where detectors do implement cloud.resource_id (and I am personally aware only of the AWS EC2 detectors and the deprecation of faas.id in favour of cloud.resource_id), it seems to me that the commonality is that they are "units of deployment": you create on the Cloud a VM or deploy a function. In ECS, the Task is the corresponding unit of deployment. The same logic would of course apply to Kubernetes Pods, where their ARN on EKS, or Pod UID elsewhere, would be the "natural" (and I know I am biased here :D) cloud.resource_id value.

@andrew-rowson-lseg
Copy link

In ECS, the Task is the corresponding unit of deployment.

I like this, and find it convincing. If this is the case, might be worth updating the definition of cloud.resource_id to be a bit more explicit?

@kaiyan-sheng
Copy link
Contributor

When specifying this field cloud.resource_id, I will go look at what is the resource that these monitoring metrics are describing. For example, if I'm looking at the CPU utilization for a cluster, then the cluster ARN should be the cloud.resource_id. But if I’m looking at CPU utilization for a specific service, then the service ARN should be the cloud.resource_id. I guess I'm going the most-granular ARN route that @andrew-rowson-lseg mentioned. If I rely on AWS CloudWatch for monitoring metrics, then the cloud.resource_id would be the most granular ARN/identifier from the metric dimensions.

@arminru arminru changed the title cloud.resource_id and ECS cloud.resource_id and AWS ECS Feb 5, 2024
@mmanciop
Copy link
Author

mmanciop commented Feb 6, 2024

When specifying this field cloud.resource_id, I will go look at what is the resource that these monitoring metrics are describing. For example, if I'm looking at the CPU utilization for a cluster, then the cluster ARN should be the cloud.resource_id. But if I’m looking at CPU utilization for a specific service, then the service ARN should be the cloud.resource_id. I guess I'm going the most-granular ARN route that @andrew-rowson-lseg mentioned. If I rely on AWS CloudWatch for monitoring metrics, then the cloud.resource_id would be the most granular ARN/identifier from the metric dimensions.

Well, that is pretty much the benefit that having cloud.resource_id as resource attribute implemented consistently would give you. AFAIK, it has almost never been implemented yet (I checked all resource detectors for SDKs, found it used only in AWS ones), and I am not aware of it ever been used yet in metrics.

@mmanciop
Copy link
Author

mmanciop commented Feb 7, 2024

@jsuereth in the Semantic Conventions meeting of Feb. 5th, 2024, you said that the cloud.resource_id should strive to represent the highest logical entity from the point of view of the user that has a unique identifier (sorry I forgot the specific quote, but this is how I internalised it). But also it seemed we agreed that that was likely the Container ARN in this case, and that it did not feel quite right. I think you mentioned bringing it up with @tigrannajaryan. I'd love to be involved in that discussion, if possible :-)

@tigrannajaryan
Copy link
Member

I wasn't in the the semconv meeting, but I am guessing @jsuereth refers to the following topic that he and I have been exploring for a while.

What is the purpose of recording cloud.resource_id? Typically the purpose is to be able to uniquely identify the entity that is producing the telemetry (if that's not the purpose, ignore this entire comment). Take for example "Container Image" vs "Container Instance". The image may be run multiple times, each execution is an instance and each instance can produce its own telemetry, so we choose "Container Instance" as the entity we record attributes in the Resource.

Similarly, a "Task" is a definition that can be executed multiple times, each execution becomes a running container instance. I cannot associate telemetry with a "Task", that would create ambiguity, I would not know which running container instance that telemetry is associated with. The right choice I think is to record attributes which uniquely identify the running container instance (Container ARN).

This is generally the litmus test I use when deciding which entity to record in the Resource: if telemetry can be associated with that entity without also being required to be associated with something else then it is the Entity we want.

Applying this litmus test, here is entities I think we want to choose to associate with telemetry:

  • Container Instance
  • Host
  • Process
  • Pod
  • K8s Cluster
  • etc

Entities that are can't be directly associated with telemetry without additional qualification:

  • Container Image
  • Task Definition
  • Operating System
  • etc.

This later list is an example of "things" that are not enough for identification purposes, so we should prefer the first list, which does uniquely identify the telemetry's source.

@mmanciop
Copy link
Author

mmanciop commented Feb 7, 2024

@tigrannajaryan the problem we have here is that on ECS there are TWO valid candidates for resource detectors in SDK-instrumented apps: ECS Task and ECS Container. Both are theoretically valid according to your litmus test. And we need to pick one. I have a favorite (see above), but in any case the definition of cloud.resource_id could use disambiguation.

@tigrannajaryan
Copy link
Member

the problem we have here is that on ECS there are TWO valid candidates for resource detectors in SDK-instrumented apps: ECS Task and ECS Container. Both are theoretically valid according to your litmus test

I may be misunderstanding, but isn't ECS Task the "definition" of what to run? You can run many container instances using the same task definition, right? In that case ESC Task is not unique enough to associate for example a metric data point with it. So I don't think ESC Task passes the litmus test, you will need to additionally say which ECS Container instance the metric data point is coming from.

Please correct me if I am wrong on what ECS Task means, this is based on just a cursory read of ECS docs.

@basti1302
Copy link

basti1302 commented Feb 7, 2024

I may be misunderstanding, but isn't ECS Task the "definition" of what to run?

I think you are confusing the task definition with an actual running task. The discussion is not about using the task definition's ARN, the discussion is about running task instances vs containers.

See here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definitions.html

After you create a task definition, you can run the task definition as a task or a service.

A task is the instantiation of a task definition within a cluster. After you create a task definition for your application within Amazon ECS, you can specify the number of tasks to run on your cluster.

(emphasis mine)

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Feb 8, 2024

I think you are confusing the task definition with an actual running task.

Indeed I am. I was incorrectly assuming that running a task definition creates a container instance. I have never used ECS so don't know the details of how it works.

Can you clarify what the difference is between a task instance and a container instance and how they are related?

@mmanciop
Copy link
Author

mmanciop commented Feb 8, 2024

To draw a comparison between K8s and ECS:

ECS Task -> K8S Pod
ECS Container -> K8S Container

But not only Tasks get their own ARNs, Containers get their own too.

The overwhelming majority of the ECS API deals with tasks. They are scheduled either singularity with RunTask (used also by other ECS-based services like AWS Batch) or by the ECS equivalent to a K8S Deployment: the Service.

Even APIs like SubmitContainerStateChange use the Task ARN and the container name.

I honestly don’t know why ECS containers get their own ARN, in my experience it feels useless from a user standpoint. But it is there, and it is making my life hard :-)

@andrew-rowson-lseg
Copy link

Typically the purpose is to be able to uniquely identify the entity that is producing the telemetry

I suspect the problem here is that "entity" is vague. You could argue that it's actually the process running within the container that's part of a running task that's the "entity producing the telemetry" in a process SDK context, but I don't think we're advocating using any form of process identifier for this field.

K8s is a useful comparison here though, because I think the same problem exists. For example a deployment can have multiple Pods, which can have multiple containers, which can have multiple processes. What's the "Resource Id" of the entity that's emitting telemetry?

I honestly don’t know why ECS containers get their own ARN, in my experience it feels useless from a user standpoint. But it is there, and it is making my life hard :-)

I think it's because there needs to be some way of uniquely identifying a container within a running task (which, as you say, can have an arbitrary number of running containers that may vary during the task's lifecycle).

--
I have another proposal:

Part of this pickle is that we're trying to work out which value (of many) to put into a scalar field. What if the scalar field was actually a list (cloud.resource.ids), where the contents determined by the context?

E.g. In an ECS context, cloud.resource.ids would contain the cluster ARN, the Task ARN, the container ARN etc. This would allow both correlation of telemetry that belonged to the same container, and the grouping of telemetry on the same higher-level resource (e.g. task, cluster etc.)

@mmanciop
Copy link
Author

mmanciop commented Feb 8, 2024

I honestly don’t know why ECS containers get their own ARN, in my experience it feels useless from a user standpoint. But it is there, and it is making my life hard :-)

I think it's because there needs to be some way of uniquely identifying a container within a running task (which, as you say, can have an arbitrary number of running containers that may vary during the task's lifecycle).

Task ARN + container id (as in DockerID) is unique enough. Container name is not unique enough I think, because if IIRC, they can restart.

@mmanciop
Copy link
Author

mmanciop commented Feb 8, 2024

Part of this pickle is that we're trying to work out which value (of many) to put into a scalar field. What if the scalar field was actually a list (cloud.resource.ids), where the contents determined by the context?

E.g. In an ECS context, cloud.resource.ids would contain the cluster ARN, the Task ARN, the container ARN etc. This would allow both correlation of telemetry that belonged to the same container, and the grouping of telemetry on the same higher-level resource (e.g. task, cluster etc.)

Lists are typically pretty bad to work with in terms of grouping and filtering in most query languages. I would really like to avoid that if at all possible.

@tigrannajaryan
Copy link
Member

Thanks for explaining how ECS works. K8s analogy helps.

Pod, Task, Container Instance and Process are all potentially valid telemetry producing entities. The choice of which entity to record in the Resource depends on what telemetry you are producing. This is where I apply the litmus test, to make this choice.

For example if you are measuring the CPU usage by the entire Pod (or the entire Task) then you want to record an identifier of the Pod (or of Task), so you would put k8s.pod.uid in the resource (or Task ARN in cloud.resource_id). If you are measuring the CPU usage of just one container in the Pod you would put both k8s.pod.uid and container.name in the Resource so that you can identify the Container Instance (or put Container ARN in cloud.resource_id). Similarly if you want to record the Process CPU usage you would put k8s.pod.uid, container.name and process.id in the Resource to uniquely identify the Process (for ECS it would be Container ARN in cloud.resource_id and process.id).

Which set of attributes to choose (or which value to put in cloud.resource_id) depends on what entity you want to associate the telemetry with.

@mmanciop
Copy link
Author

mmanciop commented Feb 8, 2024

@tigrannajaryan so, since this discussion is grounded in ECS resource detectors (as running in SDKs within applications inside containers inside tasks), you would use the Container ARN in cloud.resource_id as opposed to having Task ARN in cloud.resource_id, Docker container ID in container.id, Container name in container.name and pid in process.id?

@tigrannajaryan
Copy link
Member

Can one Task contain multiple Containers? If that is the case then we must include whatever data is necessary to uniquely identify the Container in telemetry. If we want to use just a single attribute then we seem to have no choice but to record Container ARN in cloud.resource_id otherwise it is impossible to differentiate telemetry coming from each container. Alternatively we can record Task ARN in cloud.resource_id but in that case we are also required to record an additional attribute to identify the container in the Task. We can use container.id or container.name as that second attribute (assuming they are unique within the Task).

Both approaches are valid (using just cloud.resource_id equal to Container ARN or a pair of cloud.resource_id + container.id attributes). It should be valid to use any set of attributes that uniquely identifies the entity we want to describe and it is normal that there may be more than one way to describe that same entity.

The question then is which set of attributes to choose. One argument I can bring is that using a smaller number of attributes should be preferable, which leads to using cloud.resource_id equal to Container ARN.

Additionally, if you we expect multiple processes in one Container we must also include the process.id in the Resource, otherwise the telemetry will be ambiguous.

So, in this particular use case of an application instrumented by Otel SDK that is a process inside a Container on an ECS Task I think the right set of attributes is this:

  • cloud.resource_id set to Container ARN
  • process.id set to process PID.

@mmanciop
Copy link
Author

mmanciop commented Feb 8, 2024

A task can contain multiple containers (thought it’s somehow less common than in K8S because ECS has nothing comparable to admission controller mutating webhooks AFAIK).

We can use container.id or container.name as that second attribute (assuming they are unique within the Task).

container.id is already implemented in every resource detector for ECS that I know of. container.name, as in the name set in the Task definition, is not implemented anywhere as it does not seem to be available through the ECS metadata endpoint.

The question then is which set of attributes to choose. One argument I can bring is that using a smaller number of attributes should be preferable, which leads to using cloud.resource_id equal to Container ARN.

Alright. Anyhow both Task and Container ARN are available through the AWS ECS (aws.ecs.* semantic conventions), and implemented by all ECS resource detectors. So all types of filtering and grouping is alternatively possible within ECS telemetry using those attributes.

I’ll open PRs accordingly against the detectors.

@tigrannajaryan
Copy link
Member

One other related to thing that we should probably discuss is how detectors should affect the Service, particularly the service.instance.id attribute.

There is an open PR that defines an algorithm for generating the service.instance.id and it essentially tries to detect where the service is running. That algorithm is a sort-of built-in, default detector.

I think other detectors (like ECS detector) should also try to populate the service.instance.id with a relevant value (if the default Resource is used, which produces a Service entity). The EC detector for example could set service.instance.id to something like $ContainerURN:$ProcessID.

Interestingly, some resource detectors in the Collector do that, for example Elasticbeanstalk detector.

I think we should open a separate issue and discuss it separately.

@mmanciop
Copy link
Author

@tigrannajaryan @jsuereth as far as I am concerned, this question has been answered. Should I close the issue, or do you want to keep it around to add clarifications to the wording of cloud.resource_id?

@mmanciop
Copy link
Author

Btw all the PRs are open. I wish the AWS people paid more attention to PRs against their detectors :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants