Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131

belimawr · 2023-01-18T10:43:35Z

Whenever the Elastic-Agent is run with a monitoring clustuster different from the cluster the data is sent to, some events from Metricbeat logs are dropped due to mapping issues

For confirmed bugs, please report:

Version: main
Operating System: Linux (confirmed on Arch Linux and Ubuntu 20.04)

Steps to Reproduce

Get two different Elastic-Cloud clusters (or any other clusters)
Deploy a stand alone Elastic-Agent sending data to one cluster and the monitoring data to the other, with a policy like that:

elastic-agent.yml

id: standalone-policy-hand-crafted
outputs:
  default:
    type: elasticsearch
    password: bar
    username: elastic
    allow_older_versions: true
    hosts:
      - >-
        https://data-cluster.elastic-cloud.com:443
  monitoring-output:
    type: elasticsearch
    password: foo
    username: elastic
    allow_older_versions: true
    hosts:
      - 'https://monitoring-cluster.elastic-cloud.com:443'
agent:
  download:
    sourceURI: 'https://artifacts.elastic.co/downloads/'
  monitoring:
    enabled: true
    use_output: monitoring-output
    namespace: default
    logs: true
    metrics: true
inputs:
  - id: log-input-like-the-integration-input-id
    name: log-input
    type: logfile
    use_output: default
    data_stream:
      namespace: default
    streams:
      - id: log-input-like-the-integration-stream-id
        data_stream:
          dataset: generic
        paths:
          - /tmp/flog*

Start the Elastic-Agent in standalone mode

./elastic-agent -c elastic-agent.yml

Wait for a while and you will see logs like those:

Events dropped logs

I have anonymised all IP/Mac addresses.

{"log.level":"warn","@timestamp":"2023-01-17T20:12:12.533Z","message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.January, 17, 20, 12, 9, 149000000, time.UTC), Meta:{\"input_id\":\"filestream-monitoring-agent\",\"raw_index\":\"logs-elastic_agent.metricbeat-default\",\"stream_id\":\"filestream-monitoring-agent\"}, Fields:{\"agent\":{\"ephemeral_id\":\"b12ea210-5273-46f0-a95e-6f1d64380813\",\"id\":\"c129f4df-ec30-4f10-b460-c25c35a63c4f\",\"name\":\"archlinux\",\"type\":\"filebeat\",\"version\":\"8.7.0\"},\"component\":{\"binary\":\"metricbeat\",\"dataset\":\"elastic_agent.metricbeat\",\"id\":\"beat/metrics-monitoring\",\"type\":\"beat/metrics\"},\"container\":{\"id\":\"elastic-agent-13136c\"},\"data_stream\":{\"dataset\":\"elastic_agent.metricbeat\",\"namespace\":\"default\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"c129f4df-ec30-4f10-b460-c25c35a63c4f\",\"snapshot\":true,\"version\":\"8.7.0\"},\"event\":{\"dataset\":\"elastic_agent.metricbeat\"},\"host\":{\"architecture\":\"x86_64\",\"containerized\":false,\"hostname\":\"archlinux\",\"id\":\"21282bcb80a74c08a0d14a047372256c\",\"ip\":[\"10.0.1.2\",\"aaaa::aaa:ffff:bbbb:cccc\",\"172.17.0.1\"],\"mac\":[\"BA-BA-BA-BA-BA-BA\",\"AA-AA-BB-BB-BB-BB\"],\"name\":\"archlinux\",\"os\":{\"build\":\"rolling\",\"family\":\"\",\"kernel\":\"6.1.5-arch2-1\",\"name\":\"Arch Linux\",\"platform\":\"arch\",\"type\":\"linux\",\"version\":\"\"}},\"input\":{\"type\":\"filestream\"},\"log\":{\"file\":{\"path\":\"/home/vagrant/elastic-agent-8.7.0-SNAPSHOT-linux-x86_64/data/elastic-agent-13136c/logs/elastic-agent-20230117.ndjson\"},\"offset\":38537},\"log.level\":\"info\",\"log.logger\":\"beat\",\"log.origin\":{\"file.line\":1111,\"file.name\":\"instance/beat.go\"},\"message\":\"Host info\",\"service.name\":\"metricbeat\",\"system_info\":{\"ecs.version\":\"1.6.0\",\"host\":{\"architecture\":\"x86_64\",\"boot_time\":\"2023-01-17T18:35:01Z\",\"containerized\":false,\"id\":\"21282bcb80a74c08a0d14a047372256c\",\"ip\":[\"127.0.0.1/8\",\"::1/128\",\"10.0.2.15/24\",\"ffff::ccc:aaaa:baba:abab/64\",\"172.17.0.1/16\"],\"kernel_version\":\"6.1.5-arch2-1\",\"mac\":[\"BB:BB:BB:CC:CC:CC\",\"AA:BB:CC:DD:EE:FF\"],\"name\":\"archlinux\",\"os\":{\"build\":\"rolling\",\"family\":\"\",\"major\":0,\"minor\":0,\"name\":\"Arch Linux\",\"patch\":0,\"platform\":\"arch\",\"type\":\"linux\",\"version\":\"\"},\"timezone\":\"UTC\",\"timezone_offset_sec\":0}}}, Private:(*input_logfile.updateOp)(0xc000f09cb0), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [system_info.host.ip] of type [ip] in document with id 'hARcwYUBOhEXiMJ_VjvY'. Preview of field's value: '127.0.0.1/8'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'127.0.0.1/8' is not an IP string literal.\"}}, dropping event!","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":429,"file.name":"elasticsearch/client.go"},"service.name":"filebeat","ecs.version":"1.6.0"}

Other details

Pretty printing the fields field from the rejected event we have:

Event

I have anonymised all IP/Mac addresses.

{
  "agent": {
    "ephemeral_id": "b12ea210-5273-46f0-a95e-6f1d64380813",
    "id": "c129f4df-ec30-4f10-b460-c25c35a63c4f",
    "name": "archlinux",
    "type": "filebeat",
    "version": "8.7.0"
  },
  "component": {
    "binary": "metricbeat",
    "dataset": "elastic_agent.metricbeat",
    "id": "beat/metrics-monitoring",
    "type": "beat/metrics"
  },
  "container": {
    "id": "elastic-agent-13136c"
  },
  "data_stream": {
    "dataset": "elastic_agent.metricbeat",
    "namespace": "default",
    "type": "logs"
  },
  "ecs": {
    "version": "8.0.0"
  },
  "elastic_agent": {
    "id": "c129f4df-ec30-4f10-b460-c25c35a63c4f",
    "snapshot": true,
    "version": "8.7.0"
  },
  "event": {
    "dataset": "elastic_agent.metricbeat"
  },
  "host": {
    "architecture": "x86_64",
    "containerized": false,
    "hostname": "archlinux",
    "id": "21282bcb80a74c08a0d14a047372256c",
    "ip": [
      "10.0.1.2",
      "aaaa::aaa:ffff:bbbb:cccc",
      "172.17.0.1"
    ],
    "mac": [
      "BA-BA-BA-BA-BA-BA",
      "AA-AA-BB-BB-BB-BB"
    ],
    "name": "archlinux",
    "os": {
      "build": "rolling",
      "family": "",
      "kernel": "6.1.5-arch2-1",
      "name": "Arch Linux",
      "platform": "arch",
      "type": "linux",
      "version": ""
    }
  },
  "input": {
    "type": "filestream"
  },
  "log": {
    "file": {
      "path": "/home/vagrant/elastic-agent-8.7.0-SNAPSHOT-linux-x86_64/data/elastic-agent-13136c/logs/elastic-agent-20230117.ndjson"
    },
    "offset": 38537
  },
  "log.level": "info",
  "log.logger": "beat",
  "log.origin": {
    "file.line": 1111,
    "file.name": "instance/beat.go"
  },
  "message": "Host info",
  "service.name": "metricbeat",
  "system_info": {
    "ecs.version": "1.6.0",
    "host": {
      "architecture": "x86_64",
      "boot_time": "2023-01-17T18:35:01Z",
      "containerized": false,
      "id": "21282bcb80a74c08a0d14a047372256c",
      "ip": [
        "127.0.0.1/8",
        "::1/128",
        "10.0.2.15/24",
        "ffff::ccc:aaaa:baba:abab/64",
        "172.17.0.1/16"
      ],
      "kernel_version": "6.1.5-arch2-1",
      "mac": [
        "BB:BB:BB:CC:CC:CC",
        "AA:BB:CC:DD:EE:FF"
      ],
      "name": "archlinux",
      "os": {
        "build": "rolling",
        "family": "",
        "major": 0,
        "minor": 0,
        "name": "Arch Linux",
        "patch": 0,
        "platform": "arch",
        "type": "linux",
        "version": ""
      },
      "timezone": "UTC",
      "timezone_offset_sec": 0
    }
  }
}

The problem is with the system_info.host.ip field that contains an array of IPs + their subnet mask:

system_info.host.ip

I have anonymised all IP/Mac addresses.

  "system_info": {
    "host": {
      "ip": [
        "127.0.0.1/8",
        "::1/128",
        "10.0.2.15/24",
        "ffff::ccc:aaaa:baba:abab/64",
        "172.17.0.1/16"
      ]
    }
  }

The text was updated successfully, but these errors were encountered:

alexsapran · 2023-01-19T14:56:18Z

I saw the same even if I had set in the config to ship the logs into the same output cluster.

monitoring:
    enabled: true
    use_output: default
    namespace: default
    logs: true
    metrics: true

I constantly got the Cannot index event publisher.Event

cmacknz · 2023-01-19T15:17:41Z

While we look for a fix, could we work around this by adding a drop_fields processor that drops system_info.host.ip before it is sent to Elasticsearch?

@belimawr can you try this and see if it fixes the problem?

cmacknz · 2023-01-19T15:39:38Z

I should also note that remote monitoring clusters aren't supported by Fleet today (elastic/kibana#104986) which is why this likely isn't tested very well.

belimawr · 2023-01-20T09:48:56Z

I should also note that remote monitoring clusters aren't supported by Fleet today (elastic/kibana#104986) which is why this likely isn't tested very well.

I didn't know that, but even if it's not supported, the Fleet UI allows to set a different monitoring cluster. I've validated it in the latest 8.6.0 release.

Screenshot from Kibana/Fleet UI 8.6.0

I'll try the drop processor and see if it solves the problem.

belimawr · 2023-01-24T06:44:24Z

While we look for a fix, could we work around this by adding a drop_fields processor that drops system_info.host.ip before it is sent to Elasticsearch?

@belimawr can you try this and see if it fixes the problem?

Yes it works, but it needs to be added to the filebeat that collects the Elastic-Agent logs.

This commit is a hacky fix for elastic#2131. It's intended to unblock some development rather then being a final fix.

belimawr · 2023-01-24T09:44:56Z

I created a hacky fix commit to unblock @alexsapran, in case someone else needs it, here it is: belimawr@cfd6040, you can just download and build my branch quick-fix-dropped-monitoring-event

cmacknz · 2023-01-24T14:53:42Z

Thanks for confirming. @pierrehilbert we can work around this with a custom build of agent, but we need a solution that works by default to unblock performance testing. I've added this to the next sprint for this reason.

amitkanfer · 2023-01-30T09:28:40Z

Any thoughts on how this can get tested going forward?
I there something we can test during compile time (eg. making sure the mapping file is aligned with what the actual code is going to produce?)

belimawr · 2023-01-30T10:36:42Z

I believe an e2e test would be the way to ensure catch this kind of bug. It's a behaviour emerging from a specific setup rather than a unit testable component from the stack.

rdner · 2023-02-15T17:05:33Z

I'm going to post some of my findings so far.

The error is a result of a beat (can be Metricbeat or Filebeat) ingesting and sending logs of another beat. For example, a monitoring Filebeat is sending logs of an ingesting Filebeat.

Along with the rest of the logs, every beat (when starting) writes the following message:

{
  "log.level": "info",
  "@timestamp": "2023-02-15T17:37:43.440+0100",
  "log.logger": "beat",
  "log.origin": {
    "file.name": "instance/beat.go",
    "file.line": 1111
  },
  "message": "Host info",
  "service.name": "filebeat",
  "system_info": {
    "host": {
      "architecture": "arm64",
      "boot_time": "2023-02-15T09:41:32.755854+01:00",
      "name": "MacBook-Pro.localdomain",
      "ip": [
        "127.0.0.1/8",
        "::1/128",
        "fe80::1/64",
        "fe80::6ca6:cdff:fe6a:4f59/64",
        "fe80::6ca6:cdff:fe6a:4f5b/64",
        "fe80::6ca6:cdff:fe6a:4f5a/64",
        "fe80::f84d:89ff:fe67:b0b1/64",
        "fe80::482:d6be:4cde:4ccd/64",
        "192.168.1.101/24",
        "fe80::8017:14ff:fe08:e5e/64",
        "fe80::8017:14ff:fe08:e5e/64",
        "fe80::c30f:d2ef:351:a20d/64",
        "fe80::2e15:fa5c:f61c:fcc0/64",
        "fe80::ce81:b1c:bd2c:69e/64"
      ],
      "kernel_version": "22.3.0",
      "mac": [
        "<redacted>"
      ],
      "os": {
        "type": "macos",
        "family": "darwin",
        "platform": "darwin",
        "name": "macOS",
        "version": "13.2.1",
        "major": 13,
        "minor": 2,
        "patch": 1,
        "build": "22D68"
      },
      "timezone": "CET",
      "timezone_offset_sec": 3600,
      "id": "470010DB-B6F8-5334-976D-DCEA8564B4D6"
    },
    "ecs.version": "1.6.0"
  }
}

This message is coming from this code:

https://github.com/elastic/beats/blob/47fe3c350f3f5040c1bf351246ad8aaf798d830d/libbeat/cmd/instance/beat.go#L1110-L1112

Somehow a mapping in Elasticsearch gets created (or some type inference occurs, it's not clear) where system_info.host.ip is mapped to the IP field type but technically a value such as "127.0.0.1/8" would fall under the IP Range field type, hence the validation error.

I've been trying to figure out the source of this mapping, the error is reproducible on the cloud, on a stack created by elastic-package and on our testing environment.

I believe it's easier to just alter the system_info.host.ip values trimming the masks to satisfy the validation requirement. It seems like the mask suffixes should not have been there anyway.

rdner · 2023-02-15T17:52:21Z

I found where the mapping is coming from. It's probably derived from this https://github.com/elastic/elasticsearch/blob/b0ba832791a4c761753d2618eeb270ed3bfb4181/x-pack/plugin/core/src/main/resources/data-streams-mappings.json#L6

So, when a beat indexes events it creates a data stream like this:

Which is created out of an index template logs:

Which includes a component template data-streams-mappings, which contains these definitions:

{
  "dynamic_templates": [
    {
      "match_ip": {
        "mapping": {
          "type": "ip"
        },
        "match_mapping_type": "string",
        "match": "ip"
      }
    },
    {
      "match_message": {
        "mapping": {
          "type": "match_only_text"
        },
        "match_mapping_type": "string",
        "match": "message"
      }
    },
    {
      "strings_as_keyword": {
        "mapping": {
          "ignore_above": 1024,
          "type": "keyword"
        },
        "match_mapping_type": "string"
      }
    }
  ],
  "date_detection": false,
  "properties": {
    "@timestamp": {
      "type": "date"
    },
    "ecs": {
      "properties": {
        "version": {
          "ignore_above": 1024,
          "type": "keyword"
        }
      }
    },
    "data_stream": {
      "properties": {
        "namespace": {
          "type": "constant_keyword"
        },
        "dataset": {
          "type": "constant_keyword"
        }
      }
    },
    "host": {
      "type": "object"
    }
  }
}

This part means that any field called ip will be mapped to the IP field type and as a consequence validated against the type:

        {
          "match_ip": {
            "match": "ip",
            "match_mapping_type": "string",
            "mapping": {
              "type": "ip"
            }
          }
        }

This is the root cause of the issue.

Seems like we have to ensure that any value we send in a field named ip is a valid IP address without a network mask.

alexsapran · 2023-02-15T19:07:18Z

Fantastic 🐰 🕳️ @rdner, thanks for tracing this issue!

rdner · 2023-02-16T08:32:10Z

Just to summarise the impact of this issue:

Any event that satisfies the following will be dropped:

Has a field named ip (on any level) and does not have a valid IP address value in that field. Valid examples are ::afff:4567:890a, 10.10.12.123 (note, no netmask nor port).
Is sent to any data stream on Elasticsearch

This behaviour is defined by Elasticsearch and has nothing to do with Beats.

It currently affects Elastic Agent for the following reasons:

Elastic Agent spawns a Filebeat that ingests logs of the main Filebeat (used for ingestion) for monitoring purposes
On startup every Beat logs the Host info entry that has "invalid" IP addresses, they contain the netmask, e.g fe80::8017:14ff:fe08:e5e/64 and 127.0.0.1/8
The monitoring Filebeat fails to send the Host info log entry to Elasticsearch due to validation and the event gets dropped

The only information our customer would lose in this case is the host information where the monitored Beat is running, which is not critical.

belimawr · 2023-02-16T09:21:26Z

I make @alexsapran's words mine: "Fantastic rabbit hole @rdner, thanks for tracing this issue!" :D

Just one thing I didn't understand: Why it works when we send the monitoring and the data to the same cluster? Does the Elastic-Agent modifies the mapping somehow?

rdner · 2023-02-16T11:55:28Z

@belimawr Most likely it was just hard to catch this event since it happens only on a beat (which is being monitored) restart. I can reproduce it with a 100% rate if I read this log entry with a Filebeat and send it to a data stream (any data stream that matches logs-*-*).

The open question is: are we always using data streams for monitoring or there are cases when a regular index is used? When it's just a regular index, there is no IP validation.

cmacknz · 2023-02-16T18:31:12Z

are we always using data streams for monitoring or there are cases when a regular index is used?

The agent always sends to data streams.

belimawr added bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Jan 18, 2023

belimawr added a commit to belimawr/elastic-agent that referenced this issue Jan 24, 2023

Hacky fix for dropped events from Metricbeat

cfd6040

This commit is a hacky fix for elastic#2131. It's intended to unblock some development rather then being a final fix.

cmacknz assigned belimawr Jan 24, 2023

cmacknz added v8.8.0 8.8-candidate and removed v8.8.0 labels Jan 24, 2023

pierrehilbert assigned rdner and unassigned belimawr Jan 25, 2023

rdner mentioned this issue Feb 16, 2023

Make Host info IPs valid elastic/beats#34599

Merged

4 tasks

rdner closed this as completed in elastic/beats#34599 Feb 21, 2023

rdner mentioned this issue May 25, 2023

[system] support ip or domain in sshd messages elastic/integrations#6256

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131

Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131

belimawr commented Jan 18, 2023 •

edited

Loading

alexsapran commented Jan 19, 2023

cmacknz commented Jan 19, 2023

cmacknz commented Jan 19, 2023

belimawr commented Jan 20, 2023

belimawr commented Jan 24, 2023

belimawr commented Jan 24, 2023

cmacknz commented Jan 24, 2023

amitkanfer commented Jan 30, 2023

belimawr commented Jan 30, 2023

rdner commented Feb 15, 2023

rdner commented Feb 15, 2023

alexsapran commented Feb 15, 2023

rdner commented Feb 16, 2023 •

edited

Loading

belimawr commented Feb 16, 2023

rdner commented Feb 16, 2023

cmacknz commented Feb 16, 2023

Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131

Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131

Comments

belimawr commented Jan 18, 2023 • edited Loading

Steps to Reproduce

Other details

alexsapran commented Jan 19, 2023

cmacknz commented Jan 19, 2023

cmacknz commented Jan 19, 2023

belimawr commented Jan 20, 2023

belimawr commented Jan 24, 2023

belimawr commented Jan 24, 2023

cmacknz commented Jan 24, 2023

amitkanfer commented Jan 30, 2023

belimawr commented Jan 30, 2023

rdner commented Feb 15, 2023

rdner commented Feb 15, 2023

alexsapran commented Feb 15, 2023

rdner commented Feb 16, 2023 • edited Loading

belimawr commented Feb 16, 2023

rdner commented Feb 16, 2023

cmacknz commented Feb 16, 2023

belimawr commented Jan 18, 2023 •

edited

Loading

rdner commented Feb 16, 2023 •

edited

Loading