Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/resourcedetection] The ec2 detector does not retry on initial failure #35936

Open
atoulme opened this issue Oct 22, 2024 · 11 comments
Labels
bug Something isn't working processor/resourcedetection Resource detection processor

Comments

@atoulme
Copy link
Contributor

atoulme commented Oct 22, 2024

Component(s)

processor/resourcedetection/internal/aws/ec2

What happened?

Description

The ec2 resource detection processor attempts to connect to the metadata endpoint of AWS to get ec2 machine information on startup of the collector. If the connection to the metadata endpoint fails, the detector gives up. In some cases, the collector may run ahead of the network being ready on the machine, which may cause the collector to fail to get the information.

Steps to Reproduce

Shut down the network interface of the EC2 AWS instance
Start the collector on the EC2 AWS instance
Notice the ec2 detector reports an error in logs
Start the network interface of the EC2 instance

Expected Result

The collector should eventually add resource attributes with ec2 metadata

Actual Result

The collector never changes the output

Collector version

0.112.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@atoulme atoulme added bug Something isn't working needs triage New item requiring triage and removed needs triage New item requiring triage labels Oct 22, 2024
@github-actions github-actions bot added the processor/resourcedetection Resource detection processor label Oct 22, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

Is this specific to ec2, or does this apply to other resource detectors as well?

@dmitryax
Copy link
Member

We have only seen this in ec2, but I assume it's something that potentially can be seen in other detectors.

The problem is that we assume that not all configured detectors are expected to provide data. We allow not applicable detectors to silently fail at the start time.

I believe the solution would be an additional backoff retry configuration option. If that's enabled, the processor will keep retrying. Maybe we should block the collector from starting until it succeeds or retries are exhausted.

@atoulme
Copy link
Contributor Author

atoulme commented Nov 3, 2024

Specifically in this case, the code triggered is:

if _, err = d.metadataProvider.InstanceID(ctx); err != nil {
d.logger.Debug("EC2 metadata unavailable", zap.Error(err))
return pcommon.NewResource(), "", nil
}

Copy link
Contributor

github-actions bot commented Jan 3, 2025

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 3, 2025
@moritz31
Copy link

We are facing the same issue in our eks cluster. Specifically when we update our nodes the collector will fail to get the ec2 metadata. Theirfore a lot of metadata is missing on our traces. Is their already a solution in progress for this issue ?

@github-actions github-actions bot removed the Stale label Jan 11, 2025
@atoulme
Copy link
Contributor Author

atoulme commented Jan 14, 2025

No, this issue is still open. Contributions are welcome.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 23, 2025

I have tried to address this issue with #37426 but this is not a valid approach, it looks like, given that we have ample resources to configure the AWS SDK go client to perform better.

The AWS SDK Go client retries and by default will retry up to 3 times, on connection issues. It has a timeout of 20s, which might too low for some environments.

The workaround to this issue is therefore to allow configuration of the AWS SDK Go client we use for the ec2 detector so it is more lenient. I will look into this.

Separately from this, we should change the behavior of the ec2 detector to be consistent. We should not allow it to silently fail and return an empty resource. The processor should decide, based on configuration, to fail to start or to continue to start if the detector fails to properly start.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 23, 2025

I have opened #37451 to allow configuring the retry behavior of the AWS SDK metadata client used by the ec2 detector. This change adds 2 config entries to the ec2 detector which are otherwise set to match the default offered by AWS.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 23, 2025

I have opened a second PR to tackle what I see as the inconsistent behavior of the EC2 detector: it should fail to start if it cannot perform its detection work. However, changing that behavior would be considered a breaking change. To work around this, I have made a small change that allows the ec2 detector to fail on start if the metadata detection fails, so users may opt in to that behavior.

See #37453.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 29, 2025

@dashpole @Aneurysm9 would you please review the 2 PRs above?

andrzej-stencel pushed a commit that referenced this issue Jan 30, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
In some cases, you might need to change the behavior of the AWS metadata
client from the [standard
retryer](https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/configure-retries-timeouts.html)

By default, the client retries 3 times with a max backoff delay of 20s.

We offer a limited set of options to override those defaults
specifically, such that you can set the client to retry 10 times, for up
to 5 minutes, for example:
```yaml
processors:
  resourcedetection/ec2:
    detectors: ["ec2"]
    ec2:
      max_attempts: 10
      max_backoff: 5m
```

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates to #35936

<!--Describe what testing was performed and which tests were added.-->
#### Testing
No testing was performed.
songy23 pushed a commit that referenced this issue Feb 3, 2025
…ector (#37453)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Add `fail_on_missing_metadata` option on EC2 detector

If the EC2 metadata endpoint is unavailable, the EC2 detector by default
ignores the error.
By setting `fail_on_missing_metadata` to true on the detector, the user
will now trigger an error explicitly,
which will stop the collector from starting.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates to #35936
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this issue Feb 8, 2025
…lemetry#37451)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
In some cases, you might need to change the behavior of the AWS metadata
client from the [standard
retryer](https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/configure-retries-timeouts.html)

By default, the client retries 3 times with a max backoff delay of 20s.

We offer a limited set of options to override those defaults
specifically, such that you can set the client to retry 10 times, for up
to 5 minutes, for example:
```yaml
processors:
  resourcedetection/ec2:
    detectors: ["ec2"]
    ec2:
      max_attempts: 10
      max_backoff: 5m
```

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates to open-telemetry#35936

<!--Describe what testing was performed and which tests were added.-->
#### Testing
No testing was performed.
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this issue Feb 8, 2025
…ector (open-telemetry#37453)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Add `fail_on_missing_metadata` option on EC2 detector

If the EC2 metadata endpoint is unavailable, the EC2 detector by default
ignores the error.
By setting `fail_on_missing_metadata` to true on the detector, the user
will now trigger an error explicitly,
which will stop the collector from starting.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates to open-telemetry#35936
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/resourcedetection Resource detection processor
Projects
None yet
4 participants