Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input IDs are not unique at runtime when dynamic providers are used #1751

Closed
belimawr opened this issue Nov 17, 2022 · 11 comments · Fixed by #1866
Closed

Input IDs are not unique at runtime when dynamic providers are used #1751

belimawr opened this issue Nov 17, 2022 · 11 comments · Fixed by #1866
Assignees
Labels
blocker bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0

Comments

@belimawr
Copy link
Contributor

Description

When dynamic providers are used on a integration, they end up creating 'multiple instances' of the input defined in the policy, one for each value the variable resolves. This leads to duplicated input IDs at runtime, which makes the Elastic-Agent to be stuck in a unhealthy state.

Example:

  1. Given the Kubernetes integration, it will generate an input like this:
- id: filestream-container-logs-78ab239f-a490-4d13-8c46-ef0f14844f88
  name: kubernetes-1
  revision: 1
  type: filestream
  use_output: default
  meta:
    package:
      name: kubernetes
      version: 1.27.1
  data_stream:
    namespace: default
  package_policy_id: 78ab239f-a490-4d13-8c46-ef0f14844f88
  streams:
    - id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
      data_stream:
        dataset: kubernetes.container_logs
        type: logs
      paths:
        - '/var/log/containers/*${kubernetes.container.id}.log'
      prospector.scanner.symlinks: true
      parsers:
        - container:
            stream: all
            format: auto
  1. For each kubernetes.container.id one instance of this whole input will be rendered, looking like this (focus on the streams array):
---
data_stream:
  namespace: default
name: kubernetes-1
revision: 1
type: filestream
id: filestream-container-logs-78ab239f-a490-4d13-8c46-ef0f14844f88
meta:
  package:
    name: kubernetes
    version: 1.27.1
package_policy_id: 78ab239f-a490-4d13-8c46-ef0f14844f88
streams:
- data_stream:
    dataset: kubernetes.container_logs
    type: logs
  id: kubernetes-container-logs-kube-controller-manager-kind-control-plane-3574f560581101546c1b9448b9396e13f0abbd08e7e59f7ae711427247e11873
  parsers:
  - container:
      format: auto
      stream: all
  paths:
  - "/var/log/containers/*3574f560581101546c1b9448b9396e13f0abbd08e7e59f7ae711427247e11873.log"
  prospector.scanner.symlinks: true
  1. The 'input ID' will then not be unique anymore because there are 'multiple instances' of this input, however the 'streams[].id' will be unique (that's required by Filebeat to use the Filestream input as in the example)

  2. The problem arises because the Elastic-Agent tries to ensure the 'input ID' is unique at runtime

    if hasDuplicate(outputsMap, id) {
    return nil, fmt.Errorf("invalid 'inputs.%d.id', has a duplicate id %q (id is required to be unique)", idx, id)
    }

  3. The PR introducing this change mention it's required under V2

  4. I'm not quite sure about the implications of either removing this check or changing it to ensure that inputs[].streams[].id is unique.

Steps to Reproduce

  1. create a local k8s cluster with kind kind create cluster --config kind-cluster.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
  1. bring up elastic stack versions 8.6.0-SNAPSHOT with elastic package

elastic-package stack up --version=8.6.0-SNAPSHOT -v -d

  1. Connect both docker networks with the following script
#! /bin/bash

for i in $(docker ps | grep kindest | awk '{ print $1 }'); do
    docker network connect elastic-package-stack_default   "$i"
done
  1. Create a policy with kubernetes integration and follow UI instructions on running agent on k8s.
  2. You will see agents not collecting anything and reporting the following error:
{"log.level":"error","@timestamp":"2022-11-17T16:23:29.316Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":544},"message":"failed to render components: invalid 'inputs.8.id', has a duplicate id \"filestream-container-logs-78ab239f-a490-4d13-8c46-ef0f14844f88\" (id is required to be unique)","ecs.version":"1.6.0"}

Related issues

@ChrsMark
Copy link
Member

FYI even if I disable the container_logs data_stream Agent is not shipping data and looks quite unstable.
Some logs I could collect:

{"log.level":"info","@timestamp":"2022-11-22T07:27:32.158Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":325},"message":"Existing component state changed","component":{"id":"kubernetes/metrics-default","state":"Healthy","message":"Healthy: communicating with pid '70'","inputs":[{"id":"kubernetes/metrics-default-kubernetes/metrics-kube-apiserver-e5504271-ede5-4717-a2cd-dc63983ac234","state":"Configuring","message":"found reloader for 'input'"},{"id":"kubernetes/metrics-default-kubernetes/metrics-kube-proxy-e5504271-ede5-4717-a2cd-dc63983ac234","state":"Healthy","message":"beat reloaded"},{"id":"kubernetes/metrics-default-kubernetes/metrics-events-e5504271-ede5-4717-a2cd-dc63983ac234","state":"Configuring","message":"found reloader for 'input'"},{"id":"kubernetes/metrics-default-kubernetes/metrics-kubelet-e5504271-ede5-4717-a2cd-dc63983ac234","state":"Configuring","message":"found reloader for 'input'"},{"id":"kubernetes/metrics-default-kubernetes/metrics-kube-state-metrics-e5504271-ede5-4717-a2cd-dc63983ac234","state":"Configuring","message":"found reloader for 'input'"}],"output":{"id":"kubernetes/metrics-default","state":"Healthy","message":"reloaded output component"}},"ecs.version":"1.6.0"}

Not sure if this indicates a problem or if it is related to the ID issue but we need to also test with metrics data_streams so as to ensure that we don't miss anything else here.

@belimawr
Copy link
Contributor Author

Those logs don't seem to show anything wrong, but I'll test it as well to see if I can spot any problems.

@belimawr
Copy link
Contributor Author

Actually I've just seen @blakerouse is assigned to it. I believe he is already on it.

@blakerouse
Copy link
Contributor

I am actively working on a valid solution.

@ChrsMark
Copy link
Member

ChrsMark commented Dec 5, 2022

Just to confirm: @blakerouse @belimawr did you folks also verify that #1751 (comment) is resolved by #1866? I cannot figure out if this was verified/tested too.

cc: @gizas @joshdover

@MichaelKatsoulis
Copy link
Contributor

@blakerouse are those fixes tested against kubernetes integration? I tried with latest snapshot of agent and although there is no error about input ids , nothing works. Agent does not collect logs nor metrics.

@belimawr
Copy link
Contributor Author

belimawr commented Dec 5, 2022

Just to confirm: @blakerouse @belimawr did you folks also verify that #1751 (comment) is resolved by #1866? I cannot figure out if this was verified/tested too.

cc: @gizas @joshdover

@ChrsMark I don't see any issues in the log you posted on #1751 (comment).

In that case you'd have to remove the integration for it to stop affecting the Agent.

@blakerouse did you test the fix on Kubernets?

@MichaelKatsoulis
Copy link
Contributor

There is a constant error message about cluster UUID cannot be determined

{"log.level":"error","@timestamp":"2022-12-05T14:32:35.109Z","message":"Error fetching data for metricset beat.stats: monitored beat is using Elasticsearch output but cluster UUID cannot be determined","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"ecs.version":"1.6.0"}

That appears even without Kubernetes integration. With only kubernetes logs enabled no extra error appears but also no logs are collected. I cannot find the filebeat logs under /usr/share/elastic-agent/state/data/log.

With kubernetes metrics enabled, more errors are logged.

{"log.level":"info","@timestamp":"2022-12-05T14:46:41.677Z","message":"Exiting: could not start the HTTP server for the API: listen unix /tmp/elastic-agent/6966b744446645990875f2497eca086e560d90f9709852d694f866ad1ff16a39.sock: bind: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-ea51f190-7483-11ed-a2e6-cdeee1ef423a","type":"kubernetes/metrics"},"ecs.version":"1.6.0"}

and

{"log.level":"error","@timestamp":"2022-12-05T14:48:29.996Z","message":"Error fetching data for metricset beat.state: error making http request: Get \"http://unix/state\": dial unix /tmp/elastic-agent/6966b744446645990875f2497eca086e560d90f9709852d694f866ad1ff16a39.sock: connect: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

No metrics are collected and no metricbeat logs are written into a file.

@blakerouse
Copy link
Contributor

Are you getting an error saying that the ID's are duplicated? If you are no longer getting that error then this is fixed. That was what the bug was highlighting and that is what it fixed.

I did not test the kubernetes integration, I tested that with a dynamic variable are ID's duplicated. From my testing that was fixed from my PR, and unit tests for the code path also confirm and ensure the behavior doesn't break in the future.

I would be happy to look at your diagnostic information, which will include all the computed variables, the pre-configuration before variable substitution, post configuration after substitution, the computed expected runtime model, as well as the actual running model.

As for the Error fetching data for metricset beat.stats that is a seperate error and being reported by the monitoring component that is a subprocess and is not causing any issues with any other running integration.

@cmacknz
Copy link
Member

cmacknz commented Dec 5, 2022

There is a constant error message about cluster UUID cannot be determined

This is tracked under #1860, it is preventing the Beat metrics from being collected.

I cannot find the filebeat logs under /usr/share/elastic-agent/state/data/log.

The Beat logs are now merged into the elastic-agent-* log files. That was done in #1702 as one of the last changes in V2. We have been distracted from communicating this change by bug fixes.

With kubernetes metrics enabled, more errors are logged.

{"log.level":"info","@timestamp":"2022-12-05T14:46:41.677Z","message":"Exiting: could not start the HTTP server for the API: listen unix /tmp/elastic-agent/6966b744446645990875f2497eca086e560d90f9709852d694f866ad1ff16a39.sock: bind: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-ea51f190-7483-11ed-a2e6-cdeee1ef423a","type":"kubernetes/metrics"},"ecs.version":"1.6.0"}
and

{"log.level":"error","@timestamp":"2022-12-05T14:48:29.996Z","message":"Error fetching data for metricset beat.state: error making http request: Get \"http://unix/state\": dial unix /tmp/elastic-agent/6966b744446645990875f2497eca086e560d90f9709852d694f866ad1ff16a39.sock: connect: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

No metrics are collected and no metricbeat logs are written into a file.

This is likely a new bug, there have been some issues with the paths in containers not being adjusted properly. I would fix a new bug for this issue.

@MichaelKatsoulis
Copy link
Contributor

@cmacknz I filed this new bug #1894

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants