Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eks] [bug]: getting alerts for v1.metrics.eks.amazonaws.com/default #2479

Open
ethangeralt opened this issue Nov 27, 2024 · 4 comments
Open
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@ethangeralt
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
What do you want us to build?

---> Getting consistent alert for v1.metrics.eks.amazonaws.com and alerts come in thousands and this creates a lot of confusion.

e.g - Kubernetes aggregated API v1.metrics.eks.amazonaws.com/default has reported errors. It has appeared unavailable 12.28k times averaged over the past 10m. This issue is surfacing randomly, like during upgrade or without any activity.

Which service(s) is this request for?
This could be Fargate, ECS, EKS, ECR

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

Are you currently working around this issue?
How are you currently solving this problem?

As this is not from application side, is there any way to suppress this alarm ?

Additional context
Anything else we should know?

This is creating alot of confusion if there is any issue at control plane or at customer application level and will it impact anything ?

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

@ethangeralt ethangeralt added the Proposed Community submitted issue label Nov 27, 2024
@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Nov 27, 2024
@islishude
Copy link

I have had the same alert since I upgrade eks from 1.30 to 1.31.

btw I have installed the latest prometheus components

@bbednarek
Copy link

we are facing the same issue, but with eks 1.29

@guillaumebernard84
Copy link

Hello, we have the same alerts with EKS 1.28

@haoranleo
Copy link

haoranleo commented Dec 13, 2024

Hi folks! This is a known issue and EKS is currently working on the fix. The fix has been effective for new clusters, which means new cluster should not see this error anymore during cluster upgrade (or instance refresh performed by EKS regularly, which is invisible to you). For existing clusters, the ECD to receive the fix is 01/2025.


More details on the issue:

EKS recently launched a new feature to fetch additional control plane metrics in Prometheus compatible format from Kubernetes controllers like kube-controller-manager and kube-scheduler. As part of this feature, EKS introduced a new APIService object v1.metrics.eks.amazonaws.com on EKS clusters. This APIService is used to retrieve metrics from these controllers, for example, you could scrape the metrics from kube-controller-manager directly using

kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics

EKS clusters can occasionally see these error messages in kube-apiserver log related to the unavailability of the APIService v1.metrics.eks.amazonaws.com:

available_controller.go:406] "changing APIService availability" name="v1.metrics.eks.amazonaws.com" oldStatus="True" newStatus="False" message="failing or missing response from xxx: connect: connection refused" reason="FailedDiscoveryCheck" 

controller.go:146] Error updating APIService "v1.metrics.eks.amazonaws.com" with err: failed to download v1.metrics.eks.amazonaws.com: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable

These log messages are false positive generated during an EKS Kubernetes control plane update. As the cluster update process creates new control plane instances, the API server regularly checks if this component is available. However, when this metrics server component is not ready, API server generates these log messages.

There is no action that needs to be taken on your end, this should not have any functionality impact and we can safely ignore these error messages. If you notice any availability drop for requests scraping the control plane metrics, please don't hesitate to reach out to EKS support.

Action Items EKS is taking to avoid confusion:
While the unavailability error does not impact anything, EKS team is working on to suppress the unavailable errors from the kube-apiserver to avoid confusion. Once the change is deployed, you should not see any of unavailability errors from newly launched instances or instances being terminated. ETA - 01/30/2025.


More context on APIService availability check:
The kube-apiserver checks all the registered APIService status and update their availability status periodically. In the EKS setup, kube-apiserver on each control plane instance checks availability of the APIService v1.metrics.eks.amazonaws.com by reaching to the backend component running on the same control plane instance. The backend component is configured to start after kube-apiserver is up and stop before kube-apiserver is shutdown. So during the control plane instance launch/termination events, there is a window when the backend component is unavailable while kube-apiserver is running.

In addition, all the kube-apiserver updates the availability of APIService to a Kubernetes object, which is shared among kube-apiserver. As a result, the APIService would be marked unavailable if any of kube-apiserver marks it unavailable. The functionality of the APIService is not impacted because the newly launched instances or instances being terminated are not put behind the cluster load balancer, so they won’t actually take any traffic from your requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

6 participants