Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131

Closed
belimawr opened this issue Jan 18, 2023 · 16 comments · Fixed by elastic/beats#34599
Assignees
Labels
8.8-candidate bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

belimawr commented Jan 18, 2023

Whenever the Elastic-Agent is run with a monitoring clustuster different from the cluster the data is sent to, some events from Metricbeat logs are dropped due to mapping issues

For confirmed bugs, please report:

  • Version: main
  • Operating System: Linux (confirmed on Arch Linux and Ubuntu 20.04)

Steps to Reproduce

  1. Get two different Elastic-Cloud clusters (or any other clusters)

  2. Deploy a stand alone Elastic-Agent sending data to one cluster and the monitoring data to the other, with a policy like that:

elastic-agent.yml
id: standalone-policy-hand-crafted
outputs:
  default:
    type: elasticsearch
    password: bar
    username: elastic
    allow_older_versions: true
    hosts:
      - >-
        https://data-cluster.elastic-cloud.com:443
  monitoring-output:
    type: elasticsearch
    password: foo
    username: elastic
    allow_older_versions: true
    hosts:
      - 'https://monitoring-cluster.elastic-cloud.com:443'
agent:
  download:
    sourceURI: 'https://artifacts.elastic.co/downloads/'
  monitoring:
    enabled: true
    use_output: monitoring-output
    namespace: default
    logs: true
    metrics: true
inputs:
  - id: log-input-like-the-integration-input-id
    name: log-input
    type: logfile
    use_output: default
    data_stream:
      namespace: default
    streams:
      - id: log-input-like-the-integration-stream-id
        data_stream:
          dataset: generic
        paths:
          - /tmp/flog*
  1. Start the Elastic-Agent in standalone mode
./elastic-agent -c elastic-agent.yml
  1. Wait for a while and you will see logs like those:
Events dropped logs I have anonymised all IP/Mac addresses.
{"log.level":"warn","@timestamp":"2023-01-17T20:12:12.533Z","message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.January, 17, 20, 12, 9, 149000000, time.UTC), Meta:{\"input_id\":\"filestream-monitoring-agent\",\"raw_index\":\"logs-elastic_agent.metricbeat-default\",\"stream_id\":\"filestream-monitoring-agent\"}, Fields:{\"agent\":{\"ephemeral_id\":\"b12ea210-5273-46f0-a95e-6f1d64380813\",\"id\":\"c129f4df-ec30-4f10-b460-c25c35a63c4f\",\"name\":\"archlinux\",\"type\":\"filebeat\",\"version\":\"8.7.0\"},\"component\":{\"binary\":\"metricbeat\",\"dataset\":\"elastic_agent.metricbeat\",\"id\":\"beat/metrics-monitoring\",\"type\":\"beat/metrics\"},\"container\":{\"id\":\"elastic-agent-13136c\"},\"data_stream\":{\"dataset\":\"elastic_agent.metricbeat\",\"namespace\":\"default\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"c129f4df-ec30-4f10-b460-c25c35a63c4f\",\"snapshot\":true,\"version\":\"8.7.0\"},\"event\":{\"dataset\":\"elastic_agent.metricbeat\"},\"host\":{\"architecture\":\"x86_64\",\"containerized\":false,\"hostname\":\"archlinux\",\"id\":\"21282bcb80a74c08a0d14a047372256c\",\"ip\":[\"10.0.1.2\",\"aaaa::aaa:ffff:bbbb:cccc\",\"172.17.0.1\"],\"mac\":[\"BA-BA-BA-BA-BA-BA\",\"AA-AA-BB-BB-BB-BB\"],\"name\":\"archlinux\",\"os\":{\"build\":\"rolling\",\"family\":\"\",\"kernel\":\"6.1.5-arch2-1\",\"name\":\"Arch Linux\",\"platform\":\"arch\",\"type\":\"linux\",\"version\":\"\"}},\"input\":{\"type\":\"filestream\"},\"log\":{\"file\":{\"path\":\"/home/vagrant/elastic-agent-8.7.0-SNAPSHOT-linux-x86_64/data/elastic-agent-13136c/logs/elastic-agent-20230117.ndjson\"},\"offset\":38537},\"log.level\":\"info\",\"log.logger\":\"beat\",\"log.origin\":{\"file.line\":1111,\"file.name\":\"instance/beat.go\"},\"message\":\"Host info\",\"service.name\":\"metricbeat\",\"system_info\":{\"ecs.version\":\"1.6.0\",\"host\":{\"architecture\":\"x86_64\",\"boot_time\":\"2023-01-17T18:35:01Z\",\"containerized\":false,\"id\":\"21282bcb80a74c08a0d14a047372256c\",\"ip\":[\"127.0.0.1/8\",\"::1/128\",\"10.0.2.15/24\",\"ffff::ccc:aaaa:baba:abab/64\",\"172.17.0.1/16\"],\"kernel_version\":\"6.1.5-arch2-1\",\"mac\":[\"BB:BB:BB:CC:CC:CC\",\"AA:BB:CC:DD:EE:FF\"],\"name\":\"archlinux\",\"os\":{\"build\":\"rolling\",\"family\":\"\",\"major\":0,\"minor\":0,\"name\":\"Arch Linux\",\"patch\":0,\"platform\":\"arch\",\"type\":\"linux\",\"version\":\"\"},\"timezone\":\"UTC\",\"timezone_offset_sec\":0}}}, Private:(*input_logfile.updateOp)(0xc000f09cb0), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [system_info.host.ip] of type [ip] in document with id 'hARcwYUBOhEXiMJ_VjvY'. Preview of field's value: '127.0.0.1/8'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'127.0.0.1/8' is not an IP string literal.\"}}, dropping event!","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":429,"file.name":"elasticsearch/client.go"},"service.name":"filebeat","ecs.version":"1.6.0"}

Other details

Pretty printing the fields field from the rejected event we have:

Event I have anonymised all IP/Mac addresses.
{
  "agent": {
    "ephemeral_id": "b12ea210-5273-46f0-a95e-6f1d64380813",
    "id": "c129f4df-ec30-4f10-b460-c25c35a63c4f",
    "name": "archlinux",
    "type": "filebeat",
    "version": "8.7.0"
  },
  "component": {
    "binary": "metricbeat",
    "dataset": "elastic_agent.metricbeat",
    "id": "beat/metrics-monitoring",
    "type": "beat/metrics"
  },
  "container": {
    "id": "elastic-agent-13136c"
  },
  "data_stream": {
    "dataset": "elastic_agent.metricbeat",
    "namespace": "default",
    "type": "logs"
  },
  "ecs": {
    "version": "8.0.0"
  },
  "elastic_agent": {
    "id": "c129f4df-ec30-4f10-b460-c25c35a63c4f",
    "snapshot": true,
    "version": "8.7.0"
  },
  "event": {
    "dataset": "elastic_agent.metricbeat"
  },
  "host": {
    "architecture": "x86_64",
    "containerized": false,
    "hostname": "archlinux",
    "id": "21282bcb80a74c08a0d14a047372256c",
    "ip": [
      "10.0.1.2",
      "aaaa::aaa:ffff:bbbb:cccc",
      "172.17.0.1"
    ],
    "mac": [
      "BA-BA-BA-BA-BA-BA",
      "AA-AA-BB-BB-BB-BB"
    ],
    "name": "archlinux",
    "os": {
      "build": "rolling",
      "family": "",
      "kernel": "6.1.5-arch2-1",
      "name": "Arch Linux",
      "platform": "arch",
      "type": "linux",
      "version": ""
    }
  },
  "input": {
    "type": "filestream"
  },
  "log": {
    "file": {
      "path": "/home/vagrant/elastic-agent-8.7.0-SNAPSHOT-linux-x86_64/data/elastic-agent-13136c/logs/elastic-agent-20230117.ndjson"
    },
    "offset": 38537
  },
  "log.level": "info",
  "log.logger": "beat",
  "log.origin": {
    "file.line": 1111,
    "file.name": "instance/beat.go"
  },
  "message": "Host info",
  "service.name": "metricbeat",
  "system_info": {
    "ecs.version": "1.6.0",
    "host": {
      "architecture": "x86_64",
      "boot_time": "2023-01-17T18:35:01Z",
      "containerized": false,
      "id": "21282bcb80a74c08a0d14a047372256c",
      "ip": [
        "127.0.0.1/8",
        "::1/128",
        "10.0.2.15/24",
        "ffff::ccc:aaaa:baba:abab/64",
        "172.17.0.1/16"
      ],
      "kernel_version": "6.1.5-arch2-1",
      "mac": [
        "BB:BB:BB:CC:CC:CC",
        "AA:BB:CC:DD:EE:FF"
      ],
      "name": "archlinux",
      "os": {
        "build": "rolling",
        "family": "",
        "major": 0,
        "minor": 0,
        "name": "Arch Linux",
        "patch": 0,
        "platform": "arch",
        "type": "linux",
        "version": ""
      },
      "timezone": "UTC",
      "timezone_offset_sec": 0
    }
  }
}

The problem is with the system_info.host.ip field that contains an array of IPs + their subnet mask:

system_info.host.ip I have anonymised all IP/Mac addresses.
  "system_info": {
    "host": {
      "ip": [
        "127.0.0.1/8",
        "::1/128",
        "10.0.2.15/24",
        "ffff::ccc:aaaa:baba:abab/64",
        "172.17.0.1/16"
      ]
    }
  }
@belimawr belimawr added bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Jan 18, 2023
@alexsapran
Copy link
Contributor

I saw the same even if I had set in the config to ship the logs into the same output cluster.

monitoring:
    enabled: true
    use_output: default
    namespace: default
    logs: true
    metrics: true

I constantly got the Cannot index event publisher.Event

@cmacknz
Copy link
Member

cmacknz commented Jan 19, 2023

While we look for a fix, could we work around this by adding a drop_fields processor that drops system_info.host.ip before it is sent to Elasticsearch?

@belimawr can you try this and see if it fixes the problem?

@cmacknz
Copy link
Member

cmacknz commented Jan 19, 2023

I should also note that remote monitoring clusters aren't supported by Fleet today (elastic/kibana#104986) which is why this likely isn't tested very well.

@belimawr
Copy link
Contributor Author

I should also note that remote monitoring clusters aren't supported by Fleet today (elastic/kibana#104986) which is why this likely isn't tested very well.

I didn't know that, but even if it's not supported, the Fleet UI allows to set a different monitoring cluster. I've validated it in the latest 8.6.0 release.

Screenshot from Kibana/Fleet UI 8.6.0

2023-01-20_10-43

I'll try the drop processor and see if it solves the problem.

@belimawr
Copy link
Contributor Author

While we look for a fix, could we work around this by adding a drop_fields processor that drops system_info.host.ip before it is sent to Elasticsearch?

@belimawr can you try this and see if it fixes the problem?

Yes it works, but it needs to be added to the filebeat that collects the Elastic-Agent logs.

belimawr added a commit to belimawr/elastic-agent that referenced this issue Jan 24, 2023
This commit is a hacky fix for
elastic#2131. It's intended to
unblock some development rather then being a final fix.
@belimawr
Copy link
Contributor Author

I created a hacky fix commit to unblock @alexsapran, in case someone else needs it, here it is: belimawr@cfd6040, you can just download and build my branch quick-fix-dropped-monitoring-event

@cmacknz
Copy link
Member

cmacknz commented Jan 24, 2023

Thanks for confirming. @pierrehilbert we can work around this with a custom build of agent, but we need a solution that works by default to unblock performance testing. I've added this to the next sprint for this reason.

@pierrehilbert pierrehilbert assigned rdner and unassigned belimawr Jan 25, 2023
@amitkanfer
Copy link
Contributor

Any thoughts on how this can get tested going forward?
I there something we can test during compile time (eg. making sure the mapping file is aligned with what the actual code is going to produce?)

@belimawr
Copy link
Contributor Author

I believe an e2e test would be the way to ensure catch this kind of bug. It's a behaviour emerging from a specific setup rather than a unit testable component from the stack.

@rdner
Copy link
Member

rdner commented Feb 15, 2023

I'm going to post some of my findings so far.

The error is a result of a beat (can be Metricbeat or Filebeat) ingesting and sending logs of another beat. For example, a monitoring Filebeat is sending logs of an ingesting Filebeat.

Along with the rest of the logs, every beat (when starting) writes the following message:

{
  "log.level": "info",
  "@timestamp": "2023-02-15T17:37:43.440+0100",
  "log.logger": "beat",
  "log.origin": {
    "file.name": "instance/beat.go",
    "file.line": 1111
  },
  "message": "Host info",
  "service.name": "filebeat",
  "system_info": {
    "host": {
      "architecture": "arm64",
      "boot_time": "2023-02-15T09:41:32.755854+01:00",
      "name": "MacBook-Pro.localdomain",
      "ip": [
        "127.0.0.1/8",
        "::1/128",
        "fe80::1/64",
        "fe80::6ca6:cdff:fe6a:4f59/64",
        "fe80::6ca6:cdff:fe6a:4f5b/64",
        "fe80::6ca6:cdff:fe6a:4f5a/64",
        "fe80::f84d:89ff:fe67:b0b1/64",
        "fe80::482:d6be:4cde:4ccd/64",
        "192.168.1.101/24",
        "fe80::8017:14ff:fe08:e5e/64",
        "fe80::8017:14ff:fe08:e5e/64",
        "fe80::c30f:d2ef:351:a20d/64",
        "fe80::2e15:fa5c:f61c:fcc0/64",
        "fe80::ce81:b1c:bd2c:69e/64"
      ],
      "kernel_version": "22.3.0",
      "mac": [
        "<redacted>"
      ],
      "os": {
        "type": "macos",
        "family": "darwin",
        "platform": "darwin",
        "name": "macOS",
        "version": "13.2.1",
        "major": 13,
        "minor": 2,
        "patch": 1,
        "build": "22D68"
      },
      "timezone": "CET",
      "timezone_offset_sec": 3600,
      "id": "470010DB-B6F8-5334-976D-DCEA8564B4D6"
    },
    "ecs.version": "1.6.0"
  }
}

This message is coming from this code:

https://github.com/elastic/beats/blob/47fe3c350f3f5040c1bf351246ad8aaf798d830d/libbeat/cmd/instance/beat.go#L1110-L1112

Somehow a mapping in Elasticsearch gets created (or some type inference occurs, it's not clear) where system_info.host.ip is mapped to the IP field type but technically a value such as "127.0.0.1/8" would fall under the IP Range field type, hence the validation error.

I've been trying to figure out the source of this mapping, the error is reproducible on the cloud, on a stack created by elastic-package and on our testing environment.

I believe it's easier to just alter the system_info.host.ip values trimming the masks to satisfy the validation requirement. It seems like the mask suffixes should not have been there anyway.

@rdner
Copy link
Member

rdner commented Feb 15, 2023

I found where the mapping is coming from. It's probably derived from this https://github.com/elastic/elasticsearch/blob/b0ba832791a4c761753d2618eeb270ed3bfb4181/x-pack/plugin/core/src/main/resources/data-streams-mappings.json#L6

So, when a beat indexes events it creates a data stream like this:

Screenshot 2023-02-15 at 18 47 49

Which is created out of an index template logs:

Screenshot 2023-02-15 at 18 48 11

Which includes a component template data-streams-mappings, which contains these definitions:

{
  "dynamic_templates": [
    {
      "match_ip": {
        "mapping": {
          "type": "ip"
        },
        "match_mapping_type": "string",
        "match": "ip"
      }
    },
    {
      "match_message": {
        "mapping": {
          "type": "match_only_text"
        },
        "match_mapping_type": "string",
        "match": "message"
      }
    },
    {
      "strings_as_keyword": {
        "mapping": {
          "ignore_above": 1024,
          "type": "keyword"
        },
        "match_mapping_type": "string"
      }
    }
  ],
  "date_detection": false,
  "properties": {
    "@timestamp": {
      "type": "date"
    },
    "ecs": {
      "properties": {
        "version": {
          "ignore_above": 1024,
          "type": "keyword"
        }
      }
    },
    "data_stream": {
      "properties": {
        "namespace": {
          "type": "constant_keyword"
        },
        "dataset": {
          "type": "constant_keyword"
        }
      }
    },
    "host": {
      "type": "object"
    }
  }
}

This part means that any field called ip will be mapped to the IP field type and as a consequence validated against the type:

        {
          "match_ip": {
            "match": "ip",
            "match_mapping_type": "string",
            "mapping": {
              "type": "ip"
            }
          }
        }

This is the root cause of the issue.

Seems like we have to ensure that any value we send in a field named ip is a valid IP address without a network mask.

@alexsapran
Copy link
Contributor

Fantastic 🐰 🕳️ @rdner, thanks for tracing this issue!

@rdner
Copy link
Member

rdner commented Feb 16, 2023

Just to summarise the impact of this issue:

Any event that satisfies the following will be dropped:

  • Has a field named ip (on any level) and does not have a valid IP address value in that field. Valid examples are ::afff:4567:890a, 10.10.12.123 (note, no netmask nor port).
  • Is sent to any data stream on Elasticsearch

This behaviour is defined by Elasticsearch and has nothing to do with Beats.

It currently affects Elastic Agent for the following reasons:

  • Elastic Agent spawns a Filebeat that ingests logs of the main Filebeat (used for ingestion) for monitoring purposes
  • On startup every Beat logs the Host info entry that has "invalid" IP addresses, they contain the netmask, e.g fe80::8017:14ff:fe08:e5e/64 and 127.0.0.1/8
  • The monitoring Filebeat fails to send the Host info log entry to Elasticsearch due to validation and the event gets dropped

The only information our customer would lose in this case is the host information where the monitored Beat is running, which is not critical.

@belimawr
Copy link
Contributor Author

I make @alexsapran's words mine: "Fantastic rabbit hole @rdner, thanks for tracing this issue!" :D

Just one thing I didn't understand: Why it works when we send the monitoring and the data to the same cluster? Does the Elastic-Agent modifies the mapping somehow?

@rdner
Copy link
Member

rdner commented Feb 16, 2023

@belimawr Most likely it was just hard to catch this event since it happens only on a beat (which is being monitored) restart. I can reproduce it with a 100% rate if I read this log entry with a Filebeat and send it to a data stream (any data stream that matches logs-*-*).

The open question is: are we always using data streams for monitoring or there are cases when a regular index is used? When it's just a regular index, there is no IP validation.

@cmacknz
Copy link
Member

cmacknz commented Feb 16, 2023

are we always using data streams for monitoring or there are cases when a regular index is used?

The agent always sends to data streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.8-candidate bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants