Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus errors in EKS with default configuration #401

Closed
dmitryax opened this issue Mar 8, 2022 · 10 comments
Closed

Prometheus errors in EKS with default configuration #401

dmitryax opened this issue Mar 8, 2022 · 10 comments
Labels
bug Something isn't working

Comments

@dmitryax
Copy link
Contributor

dmitryax commented Mar 8, 2022

Recent version of the helm chart installed with default configuration in EKS throws the following errors:

2022-03-08T04:28:31.722Z	error	prometheusexporter/prometheus.go:141	Could not get prometheus metrics	{"kind": "receiver", "name": "receiver_creator", "monitorType": "kubernetes-proxy", "error": "Get \"http://192.168.69.78:10249/metrics\": dial tcp 192.168.69.78:10249: connect: connection refused"}

k8s version: v1.21.5-eks-bc4871b

@dmitryax
Copy link
Contributor Author

dmitryax commented Mar 8, 2022

@jvoravong can you please take a look? Looks like caused by the latest changes for the control plane. Maybe need to disable it kubernetes-proxy by default?

@dmitryax dmitryax added the bug Something isn't working label Mar 8, 2022
@jvoravong
Copy link
Contributor

@dmitryax Thanks for reporting this. I originally assumed that pod labels were unique enough or that the managed kubernetes clusters didn't allow the k8s_observer to pick up on private control plane pods.

  • This issue is present with EKS with only the proxy metrics.
  • This issue is present on GKE with only the coredns metrics.
  • This issue is not present on AKS.

It seems managed kubernetes clusters are trying to expose more control plane metrics in the long run which can cause more issues like this. The eks team is working on exposing proxy metrics (aws/containers-roadmap#657).

Proposal:
We could use the .Values.distribution and some helper methods to know whether we should enable a control plane integrations by default. In the default values file we would leave the agent.controlPlaneMetrics subsection commented out and use this logic.

  • If agent.controlPlaneMetrics.{component}.enabled=undefined, we enable the control plane metrics depending on what distribution is being used. The openshift and kubernetes other ("") distributions would have control plane metrics enabled by default while all other distributions would not.
  • If agent.controlPlaneMetrics.{component}.enabled=true, the control plane component metrics are enabled.
  • If agent.controlPlaneMetrics.{component}.enabled=false, the control plane component metrics are disabled.

Thoughts?

@dmitryax
Copy link
Contributor Author

dmitryax commented Mar 8, 2022

I think we should keep enabled as is, without an undefined state. Otherwise it makes it hard to understand the logic.

Why can't we figure out pod label based on the distribution?

@jvoravong
Copy link
Contributor

jvoravong commented Mar 8, 2022

We explicitly don't support EKS, GKE and AKS control plane metrics at this time. Our control plane integrations default to using a default discovery rule if the distribution is not openshift. The EKS proxy pods and GKE coredns pods happen to match our default discovery rules and are exposed enough for our k8s_observer to pick them up, but these pods don't actually support metrics (at this time).

We could add discovery rules that will never match a pod specifically for the distributions that don't support control plane metrics. But I think this clutters the agent config file.
Example:
Before:

{{- if eq .Values.distribution "openshift" }}
rule: type == "pod" && namespace == "openshift-dns" && name contains "dns"
{{- else }}
rule: type == "pod" && labels["k8s-app"] == "kube-dns"
{{- end }}

After:

{{- if eq .Values.distribution "openshift" }}
rule: type == "pod" && namespace == "openshift-dns" && name contains "dns"
{{- else if eq .Values.distribution "" }}
rule: type == "pod" && labels["k8s-app"] == "kube-dns"
{{- else }}
rule: type == "pod" && labels["k8s-app"] == "kube-dns-is-not-supported"
{{- end }}
       

We could also just not include the control plane receivers if using an unsupported distribution, that would probably be cleaner.

{{- if or (eq .Values.distribution "openshift") (eq .Values.distribution "") }}
// control plane receivers
{{- end }}`

@dmitryax
Copy link
Contributor Author

dmitryax commented Mar 9, 2022

I would recommend just not setting up the control plane receivers for unsupported distributions as you suggested in the last snippet

@jvoravong
Copy link
Contributor

@lindhe This helm chart does support collecting many metrics from AKS, but it specifically does not support collecting metrics the AKS control plane which is what _otel-agent.tpl#L79 is referring to. Managed Kubernetes services such as AKS do not allow the user to access the control plane for metric collection.

@lindhe
Copy link
Contributor

lindhe commented Mar 23, 2022

Alright, thanks for clarifying!

Is the first documentation I linked to out-of-date then?

@jvoravong
Copy link
Contributor

Documentation was added for these changes, see advanced-configuration.md under the Control plane metrics section

@lindhe
Copy link
Contributor

lindhe commented Mar 24, 2022

Hm... I'm sure it's just me missing the point here. But to me it looks like the documentation contradicts itself.

Here it says:

* Supported Distributions:
  * kubernetes 1.22 (kops created)
  * openshift v4.9
* Unsupported Distributions:
  * aks
  * eks
  * eks/fargate
  * gke
  * gke/autopilot

And here it says:

Use the `distribution` parameter to provide information about underlying
Kubernetes deployment. This parameter allows the connector to automatically
scrape additional metadata. The supported options are:

- `aks` - Azure AKS
- `eks` - Amazon EKS
- `eks/fargate` - Amazon EKS with Fargate profiles
- `gke` - Google GKE / Standard mode
- `gke/autopilot` - Google GKE / Autopilot mode
- `openshift` - Red Hat OpenShift

Is "the distribution parameter" referring to something other than the distribution field in values.yaml?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants