-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log any FQDN lookup errors and fallback to OS-reported hostname #34946
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
if err != nil { | ||
// FQDN lookup is "best effort". We log the error, fallback to | ||
// the OS-reported hostname, and move on. | ||
p.logger.Warnf("unable to lookup FQDN: %s, using hostname = %s as FQDN", err.Error(), hostname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often would this be logged? We may want to lower this to Debug to avoid unnecessarily spamming the logs when this fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we would be better off with a metrics counter for how often this resolution has failed.
That would tell us it's failing without worrying about log spam, and we could then turn on debug logging to see exactly what is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often would this be logged? We may want to lower this to Debug to avoid unnecessarily spamming the logs when this fails.
Whenever the add_host_metadata
cache expires. By default, that's every 5 minutes. That's quite frequent, so agreed on lowering the log level here to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we would be better off with a metrics counter for how often this resolution has failed.
That would tell us it's failing without worrying about log spam, and we could then turn on debug logging to see exactly what is happening.
Makes sense. Let me look into implementing a metrics counter.
I assume writing a regression test for this is challenging because it requires us to modify the hostname of the system running the tests to one that doesn't resolve via DNS? |
Also your test steps are excellent, thank you! I followed them exactly and the lack of fqdn logs caught that my attempt to switch branches failed because I had local file modifications to discard first :p |
Yes. Perhaps we can do something with a Docker testcontainer, though. Let me look into that. |
|
Thanks to @leehinman's suggestion about https://github.com/foxcpp/go-mockdns, I was able to implement a couple of unit test cases - one where the FQDN lookup will succeed and one where it will fail. I've also implemented the metrics counter and dropped the log level for the FQDN lookup failure in the This PR is ready for review again. |
// New constructs a new add_host_metadata processor. | ||
func New(cfg *config.C) (processors.Processor, error) { | ||
c := defaultConfig() | ||
if err := cfg.Unpack(&c); err != nil { | ||
return nil, fmt.Errorf("fail to unpack the %v configuration: %w", processorName, err) | ||
} | ||
|
||
// Logging and metrics (each processor instance has a unique ID). | ||
var ( | ||
id = int(instanceID.Inc()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time I've seen this way of getting around the namespace collision issue. CC @fearful-symmetry, this could help the version of this problem you have in the shipper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, interesting!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. My comment about the processors stats being per instance of the processor is more of a nit.
Thanks for finding a way to test this!
/test |
* Log any FQDN lookup errors, fallback to OS hostname, and move on. * Implement fallback logic * Bumping up go-sysinfo dependency version * Update call to host.ReportInfo * Better log message * Bumping up version on elastic-agent-system-metrics dependency * Updating NOTICE.txt * Log FQDN lookup failure as warning * Move FQDN initialization to after logging has been configured * Better log message * Add metric + test * Update NOTICE.txt * Use a single namespace for all add_host_metadata processor instances' monitoring * Fixing imports * Create monitoring registry only once * Check errors (cherry picked from commit 895505c) # Conflicts: # NOTICE.txt # go.mod # go.sum # libbeat/cmd/instance/beat.go # libbeat/processors/add_host_metadata/add_host_metadata.go # libbeat/processors/add_host_metadata/add_host_metadata_test.go
* Log any FQDN lookup errors, fallback to OS hostname, and move on. * Implement fallback logic * Bumping up go-sysinfo dependency version * Update call to host.ReportInfo * Better log message * Bumping up version on elastic-agent-system-metrics dependency * Updating NOTICE.txt * Log FQDN lookup failure as warning * Move FQDN initialization to after logging has been configured * Better log message * Add metric + test * Update NOTICE.txt * Use a single namespace for all add_host_metadata processor instances' monitoring * Fixing imports * Create monitoring registry only once * Check errors (cherry picked from commit 895505c)
…reported hostname (#34971) * Log any FQDN lookup errors and fallback to OS-reported hostname (#34946) * Log any FQDN lookup errors, fallback to OS hostname, and move on. * Implement fallback logic * Bumping up go-sysinfo dependency version * Update call to host.ReportInfo * Better log message * Bumping up version on elastic-agent-system-metrics dependency * Updating NOTICE.txt * Log FQDN lookup failure as warning * Move FQDN initialization to after logging has been configured * Better log message * Add metric + test * Update NOTICE.txt * Use a single namespace for all add_host_metadata processor instances' monitoring * Fixing imports * Create monitoring registry only once * Check errors (cherry picked from commit 895505c) * Making changes lost in rebase * Fixing conflicts * Reordering imports --------- Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
* Log any FQDN lookup errors, fallback to OS hostname, and move on. * Implement fallback logic * Bumping up go-sysinfo dependency version * Update call to host.ReportInfo * Better log message * Bumping up version on elastic-agent-system-metrics dependency * Updating NOTICE.txt * Log FQDN lookup failure as warning * Move FQDN initialization to after logging has been configured * Better log message * Add metric + test * Update NOTICE.txt * Use a single namespace for all add_host_metadata processor instances' monitoring * Fixing imports * Create monitoring registry only once * Check errors
What does this PR do?
This PR fixes a bug wherein a Beat would fail to start if the FQDN lookup failed.
Why is it important?
FQDN lookup is "best effort". As such, if the lookup fails for some reason, we should not fail to start a Beat or halt execution otherwise. Instead, we should fallback to the OS-provided hostname and continue execution.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry inCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Build Filebeat (or any other Beat) with this PR.
cd beats/filebeat mage clean build
Set hostname to something that wouldn't resolve via DNS. On MacOS this can be done via:
orig_hostname=$(hostname -f) sudo scutil --set HostName cowchicken
Run
filebeat version
. This command should report the version as usual and NOT fail with an FQDN-related error (as reported in Cannot start Beats - fails with error: could not get FQDN #34910)Create a configuration file for the Beat that enables the
add_host_metadata
processor and also enables the FQDN feature flag.Start the Beat with the above configuration and look for log entries mentioning
FQDN
. You should see two log entries in all, one from Beat initialization and one from theadd_host_metadata
processor.Reset hostname at the end of the test. On MacOS this can be done via:
sudo scutil --set HostName $orig_hostname
Related issues
Use cases
Screenshots
Logs