-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to index some logs from Elastic-Agent when using a different monitoring cluster #2131
Comments
I saw the same even if I had set in the config to ship the logs into the same output cluster. monitoring:
enabled: true
use_output: default
namespace: default
logs: true
metrics: true I constantly got the |
While we look for a fix, could we work around this by adding a drop_fields processor that drops @belimawr can you try this and see if it fixes the problem? |
I should also note that remote monitoring clusters aren't supported by Fleet today (elastic/kibana#104986) which is why this likely isn't tested very well. |
I didn't know that, but even if it's not supported, the Fleet UI allows to set a different monitoring cluster. I've validated it in the latest 8.6.0 release. I'll try the drop processor and see if it solves the problem. |
Yes it works, but it needs to be added to the filebeat that collects the Elastic-Agent logs. |
This commit is a hacky fix for elastic#2131. It's intended to unblock some development rather then being a final fix.
I created a hacky fix commit to unblock @alexsapran, in case someone else needs it, here it is: belimawr@cfd6040, you can just download and build my branch |
Thanks for confirming. @pierrehilbert we can work around this with a custom build of agent, but we need a solution that works by default to unblock performance testing. I've added this to the next sprint for this reason. |
Any thoughts on how this can get tested going forward? |
I believe an e2e test would be the way to ensure catch this kind of bug. It's a behaviour emerging from a specific setup rather than a unit testable component from the stack. |
I'm going to post some of my findings so far. The error is a result of a beat (can be Metricbeat or Filebeat) ingesting and sending logs of another beat. For example, a monitoring Filebeat is sending logs of an ingesting Filebeat. Along with the rest of the logs, every beat (when starting) writes the following message: {
"log.level": "info",
"@timestamp": "2023-02-15T17:37:43.440+0100",
"log.logger": "beat",
"log.origin": {
"file.name": "instance/beat.go",
"file.line": 1111
},
"message": "Host info",
"service.name": "filebeat",
"system_info": {
"host": {
"architecture": "arm64",
"boot_time": "2023-02-15T09:41:32.755854+01:00",
"name": "MacBook-Pro.localdomain",
"ip": [
"127.0.0.1/8",
"::1/128",
"fe80::1/64",
"fe80::6ca6:cdff:fe6a:4f59/64",
"fe80::6ca6:cdff:fe6a:4f5b/64",
"fe80::6ca6:cdff:fe6a:4f5a/64",
"fe80::f84d:89ff:fe67:b0b1/64",
"fe80::482:d6be:4cde:4ccd/64",
"192.168.1.101/24",
"fe80::8017:14ff:fe08:e5e/64",
"fe80::8017:14ff:fe08:e5e/64",
"fe80::c30f:d2ef:351:a20d/64",
"fe80::2e15:fa5c:f61c:fcc0/64",
"fe80::ce81:b1c:bd2c:69e/64"
],
"kernel_version": "22.3.0",
"mac": [
"<redacted>"
],
"os": {
"type": "macos",
"family": "darwin",
"platform": "darwin",
"name": "macOS",
"version": "13.2.1",
"major": 13,
"minor": 2,
"patch": 1,
"build": "22D68"
},
"timezone": "CET",
"timezone_offset_sec": 3600,
"id": "470010DB-B6F8-5334-976D-DCEA8564B4D6"
},
"ecs.version": "1.6.0"
}
} This message is coming from this code: Somehow a mapping in Elasticsearch gets created (or some type inference occurs, it's not clear) where I've been trying to figure out the source of this mapping, the error is reproducible on the cloud, on a stack created by I believe it's easier to just alter the |
I found where the mapping is coming from. It's probably derived from this https://github.com/elastic/elasticsearch/blob/b0ba832791a4c761753d2618eeb270ed3bfb4181/x-pack/plugin/core/src/main/resources/data-streams-mappings.json#L6 So, when a beat indexes events it creates a data stream like this: Which is created out of an index template Which includes a component template {
"dynamic_templates": [
{
"match_ip": {
"mapping": {
"type": "ip"
},
"match_mapping_type": "string",
"match": "ip"
}
},
{
"match_message": {
"mapping": {
"type": "match_only_text"
},
"match_mapping_type": "string",
"match": "message"
}
},
{
"strings_as_keyword": {
"mapping": {
"ignore_above": 1024,
"type": "keyword"
},
"match_mapping_type": "string"
}
}
],
"date_detection": false,
"properties": {
"@timestamp": {
"type": "date"
},
"ecs": {
"properties": {
"version": {
"ignore_above": 1024,
"type": "keyword"
}
}
},
"data_stream": {
"properties": {
"namespace": {
"type": "constant_keyword"
},
"dataset": {
"type": "constant_keyword"
}
}
},
"host": {
"type": "object"
}
}
} This part means that any field called {
"match_ip": {
"match": "ip",
"match_mapping_type": "string",
"mapping": {
"type": "ip"
}
}
} This is the root cause of the issue. Seems like we have to ensure that any value we send in a field named |
Fantastic 🐰 🕳️ @rdner, thanks for tracing this issue! |
Just to summarise the impact of this issue: Any event that satisfies the following will be dropped:
This behaviour is defined by Elasticsearch and has nothing to do with Beats. It currently affects Elastic Agent for the following reasons:
The only information our customer would lose in this case is the host information where the monitored Beat is running, which is not critical. |
I make @alexsapran's words mine: "Fantastic rabbit hole @rdner, thanks for tracing this issue!" :D Just one thing I didn't understand: Why it works when we send the monitoring and the data to the same cluster? Does the Elastic-Agent modifies the mapping somehow? |
@belimawr Most likely it was just hard to catch this event since it happens only on a beat (which is being monitored) restart. I can reproduce it with a 100% rate if I read this log entry with a Filebeat and send it to a data stream (any data stream that matches The open question is: are we always using data streams for monitoring or there are cases when a regular index is used? When it's just a regular index, there is no IP validation. |
The agent always sends to data streams. |
Whenever the Elastic-Agent is run with a monitoring clustuster different from the cluster the data is sent to, some events from Metricbeat logs are dropped due to mapping issues
For confirmed bugs, please report:
main
Steps to Reproduce
Get two different Elastic-Cloud clusters (or any other clusters)
Deploy a stand alone Elastic-Agent sending data to one cluster and the monitoring data to the other, with a policy like that:
elastic-agent.yml
Events dropped logs
I have anonymised all IP/Mac addresses.Other details
Pretty printing the
fields
field from the rejected event we have:Event
I have anonymised all IP/Mac addresses.The problem is with the
system_info.host.ip
field that contains an array of IPs + their subnet mask:system_info.host.ip
I have anonymised all IP/Mac addresses.The text was updated successfully, but these errors were encountered: